Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-mid-autumn-shoppers-dream-amazon-fulfilled-thanksgiving-look-like
Aaron Lazar
24 Nov 2017
10 min read
Save for later

A mid-autumn Shopper’s dream - What an Amazon fulfilled Thanksgiving would look like

Aaron Lazar
24 Nov 2017
10 min read
I’d been preparing for Thanksgiving a good 3 weeks in advance. One reason is that I’d recently rented out a new apartment and the troops were heading over to my place this year. I obviously had to make sure everything went well and for that, trust me, there was no resting even for a minute! Thanksgiving is really about being thankful for the people and things in your life and spending quality time with family. This Thanksgiving I’m especially grateful to Amazon for making it the best experience ever! Read on to find out how Amazon made things awesome! Good times started two weeks ago when I was at the AmazonGo store with my friend, Sue. [embed]https://www.youtube.com/watch?v=NrmMk1Myrxc[/embed] In fact, this was the first time I had set foot in one of the stores. I wanted to see what was so cool about them and why everyone had been talking about them for so long! The store was pretty big and lived up to the A to Z concept, as far as I could see. The only odd thing was that I didn’t notice any queues or a billing counter. Sue glided around the floor with ease, as if she did this every day. I was more interested in seeing what was so special about this place. After she got her stuff, she headed straight for the door. I smiled to myself thinking how absent minded she was. So I called her back and reminded her “You haven’t gotten your products billed.” She smiled back at me and shrugged, “I don’t need to.” Before I could open my mouth to tell her off for stealing, she explained to me about the store. It’s something totally futuristic! Have you ever imagined not having to stand in a line to buy groceries? At the store, you just had to log in to your AmazonGo app on your phone, enter the store, grab your stuff and then leave. The sensors installed everywhere in the store automatically detected what you’d picked up and would bill you accordingly. They also used Computer Vision and Deep Learning to track people and their shopping carts. Now that’s something! And you even got a receipt! Well, it was my birthday last week and knowing what an avid reader I was, my colleagues from office gifted me a brand new Kindle. I loved every bit of it, but the best part was the X-ray feature. With X-ray, you could simply get information about a character, person or terms in a book. You could also scroll through the long lists of excerpts and click on one to go directly to that particular portion of the book! That’s really amazing, especially if you want to read a particular part of the book quickly. It came in use at the right time - I downloaded a load of recipe books for the turkey. Another feather in the cap for Amazon! Talking about feathers in one’s cap, you won’t believe it, but Amazon actually got me rekognised at work a few days ago. Nah, that wasn’t a typo. I worked as a software developer/ML engineer in a startup and I’d been doing this for as long as I can remember. I recently built this cool mobile application that recognized faces and unlocked your phone even when you didn’t have something like Face ID on your phone and the app had gotten us a million downloads in a month! It could also recognize and give you information about the ethnicity of a person if you captured their photograph with the phone’s camera. The trick was that I’d used the AmazonRekognition APIs for enhanced face detection in the application. Rekognition allows you to detect objects, scenes, text, and faces, using highly scalable, deep learning models. I also enhanced the application using the Polly API. Polly converts text to whichever language you want the speech in and gives you the synthesized speech in the form of audio files.The app I built now converted input text into 18 different languages, helping one converse with the person in front of them in that particular language, should they have a problem doing it in English. I got that long awaited promotion right after! Ever wondered how I got the new apartment? ;) Since the folks were coming over to my place in a few days, I thought I’d get a new dinner set. You’d probably think I would need to sit down at my PC or probably pick up my phone to search for a set online, but I had better things to do. Thanks to Alexa, I simply needed to ask her to find one for me and she did it brilliantly. Now Alexa isn’t my girlfriend, although I would have loved that to be. Alexa is actually Amazon’s cloud-based voice service that provides customers with an engaging way of interacting with technology. Alexa is blessed with finely tuned ASR or Automatic Speech Recognition and NLU or Natural Language Understanding engines, that instantly recognize and respond to voice requests. I selected a pretty looking set and instantly bought it through my Prime account. With technology like this at my fingertips, the developer in me had left no time in exploring possibilities with Alexa. That’s when I found out about Lex, built on the same deep learning platform that Alexa works on, which allows developers to build conversational interfaces into their apps. With the dinner set out of the way, I sat back with my feet up on the table. I was awesome, baby! Oh crap! I forgot to buy the turkey, the potatoes, the wine and a whole load of other stuff. It was 3 AM and I started panicking. I remembered that mum always put the turkey in the fridge at least 3 days in advance. I had only 2! I didn’t even have the time to make it to the AmazonGo store. I was panicking again and called up Suzy to ask her if she could pick up the stuff for me. She sounded so calm over the phone when I narrated my horror to her. She simply told me to get the stuff from AmazonFresh. So I hastily disconnected the call and almost screamed to Alexa, “Alexa, find me a big bird!”, and before I realized what I had said, I was presented with this. [caption id="attachment_2215" align="aligncenter" width="184"] Big Bird is one of the main protagonist in Sesame Street.[/caption] So I tried again, this time specifying what I actually needed! With AmazonDash integrating with AmazonFresh, I was able to get the turkey and other groceries delivered home in no time! What a blessing, indeed! A day before Thanksgiving, I was stuck in the office, working late on a new project. We usually tinkered around with a lot of ML and AI stuff. There was this project which needed the team to come up with a really innovative algorithm to perform a deep learning task. As the project lead, I was responsible for choosing the tech stack and I’m glad a little birdie had recently told me about AWS taking in MXNet officially as a Deep Learning Framework. MXNet made it a breeze to build ML applications that train quickly and could run anywhere. Moreover, with the recent collaboration between Amazon and Microsoft, a new ML library called Gluon was born. Available in MXNet, Gluon made building ML models, even more, easier and quicker, without compromising on performance. Need I say the project was successful? I got home that evening and sat down to pick a good flick or two to download from Amazon PrimeVideo. There’s always someone in the family who’d suggest we all watch a movie and I had to be prepared. With that done I quickly showered and got to bed. It was going to be a long day the next day! 4 AM my alarm rang and I was up! It was Thanksgiving, and what a wonderful day it was! I quickly got ready and prepared to start cooking. I got the bird out of the freezer and started to thaw it in cold water. It was a big bird so it was going to take some time. In the meantime, I cleaned up the house and then started working on the dressing. Apples, sausages, and cranberry. Yum! As I sliced up the sausages I realized that I had misjudged the quantity. I needed to get a couple more packets immediately! I had to run to the grocery store right away or there would be a disaster! But it took me a few minutes to remember it was Thanksgiving, one of the craziest days to get out on the road. I could call the store delivery guy or probably Amazon Dash, but then that would be illogical cos he’d have to take the same congested roads to get home.  I turned to Alexa for help, “Alexa, how do I get sausages delivered home in the next 30 minutes?”. And there I got my answer - Try Amazon PrimeAir. Now I don’t know about you, but having a drone deliver a couple packs of sausages to my house, is nothing less than ecstatic! I sat it out near the window for the next 20 minutes, praying that the package wouldn’t be intercepted by some hungry birds! I couldn’t miss the sight of the pork flying towards my apartment. With the dressing and turkey baked and ready, things were shaping up much better than I had expected. The folks started rolling in by lunchtime. Mum and dad were both quite impressed with the way I had organized things. I was beaming and in my mind hi-fived Amazon for helping me make everything possible with its amazing products and services designed to delight customers. It truly lives up to its slogan: Work hard. Have fun. Make history. If you are one of those folks who do this every day, behind the scenes, by building amazing products powered by machine learning and big data to make other's lives better, I want to thank you today for all your hard work. This Thanksgiving weekend, Packt's offering an unbelievable deal - Buy any book or video for just $10 or any three for $25! I know what I have my eyes on! Python Machine Learning - Second Edition by Sebastian Raschka and Vahid Mirjalili Effective Amazon Machine Learning by Alexis Perrier OpenCV 3 - Advanced Image Detection and Reconstruction [Video] by Prof. Robert Laganiere In the end, there’s nothing better than spending quality time with your family, enjoying a sumptuous meal, watching smiles all around and just being thankful for all you have. All I could say was, this Thanksgiving was truly Amazon fulfilled! :) Happy Thanksgiving folks!    
Read more
  • 0
  • 0
  • 11743

article-image-introducing-splunk
Packt
03 Mar 2015
14 min read
Save for later

Introducing Splunk

Packt
03 Mar 2015
14 min read
In this article by Betsy Page Sigman, author of the book Splunk Essentials, Splunk, whose "name was inspired by the process of exploring caves, or splunking, helps analysts, operators, programmers, and many others explore data from their organizations by obtaining, analyzing, and reporting on it. This multinational company, cofounded by Michael Baum, Rob Das, and Erik Swan, has a core product called "Splunk Enterprise. This manages searches, inserts, deletes, and filters, and analyzes big data that is generated by machines, as well as other types of data. "They also have a free version that has most of the capabilities of Splunk Enterprise and is an excellent learning tool. (For more resources related to this topic, see here.) Understanding events, event types, and fields in Splunk An understanding of events and event types is important before going further. Events In Splunk, an event is not just one of" the many local user meetings that are set up between developers to help each other out (although those can be very useful), "but also refers to a record of one activity that is recorded in a log file. Each event usually has: A timestamp indicating the date and exact time the event was created Information about what happened on the system that is being tracked Event types An event type is a way to allow "users to categorize similar events. It is field-defined by the user. You can define an event type in several ways, and the easiest way is by using the SplunkWeb interface. One common reason for setting up an event type is to examine why a system has failed. Logins are often problematic for systems, and a search for failed logins can help pinpoint problems. For an interesting example of how to save "a search on failed logins as an event type, visit http://docs.splunk.com/Documentation/Splunk/6.1.3/Knowledge/ClassifyAndGroupSimilarEvents#Save_a_search_as_a_new_event_type. Why are events and event types so important in Splunk? Because without events, there would be nothing to search, of course. And event types allow us to make meaningful searches easily and quickly according to our needs, as we'll see later. Sourcetypes Sourcetypes are also "important to understand, as they help define the rules for an event. A sourcetype is one of the default fields that Splunk assigns to data as it comes into the system. It determines what type of data it is so that Splunk can format it appropriately as it indexes it. This also allows the user who wants to search the "data to easily categorize it. Some of the common sourcetypes are listed as follows: access_combined, for "NCSA combined format HTTP web server logs apache_error, for standard "Apache web server error logs cisco_syslog, for the "standard syslog produced by Cisco network devices (including PIX firewalls, routers, and ACS), usually via remote syslog to a central log host websphere_core, a core file" export from WebSphere (Source: http://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter) Fields Each event in Splunk is" associated with a number of fields. The core fields of host, course, sourcetype, and timestamp are key to Splunk. These fields are extracted from events at multiple points in the data processing pipeline that Splunk uses, and each of these fields includes a name and a value. The name describes the field (such as the userid) and the value says what that field's value is (susansmith, for example). Some of these fields are default fields that are given because of where the event came from or what it is. When data is processed by Splunk, and when it is indexed or searched, it uses these fields. For indexing, the default fields added include those of host, source, and sourcetype. When searching, Splunk is able to select from a bevy of fields that can either be defined by the user or are very basic, such as action results in a purchase (for a website event). Fields are essential for doing the basic work of Splunk – that is, indexing and searching. Getting data into Splunk It's time to spring into action" now and input some data into Splunk. Adding data is "simple, easy, and quick. In this section, we will use some data and tutorials created by Splunk to learn how to add data: Firstly, to obtain your data, visit the tutorial data at http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk that is readily available on Splunk. Here, download the folder tutorialdata.zip. Note that this will be a fresh dataset that has been collected over the last 7 days. Download it but don't extract the data from it just yet. You then need to log in to Splunk, using admin as the username and then by using your password. Once logged in, you will notice that toward the upper-right corner of your screen is the button Add Data, as shown in the following screenshot. Click "on this button: Button to Add Data Once you have "clicked on this button, you'll see a screen" similar to the "following screenshot: Add Data to Splunk by Choosing a Data Type or Data Source Notice here the "different types of data that you can select, as "well as the different data sources. Since the data we're going to use is a file, under "Or Choose a Data Source, click on From files and directories. Once you have clicked on this, you can then click on the radio button next to Skip preview, as indicated in the following screenshot, since you don't need to preview the data" now. You then need to click on "Continue: Preview data You can download the tutorial files at: http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk As shown in the next screenshot, click on Upload and index a file, find the tutorialdata.zip file you just downloaded (it is probably in your Downloads folder), and then click on More settings, filling it in as shown in the following screenshot. (Note that you will need to select Segment in path under Host and type 1 under Segment Number.) Click on Save when you are done: Can specify source, additional settings, and source type Following this, you "should see a screen similar to the following" screenshot. Click on Start Searching, we will look at the data now: You should see this if your data has been successfully indexed into Splunk. You will now" see a screen similar to the following" screenshot. Notice that the number of events you have will be different, as will the time of the earliest event. At this point, click on Data Summary: The Search screen You should see the Data Summary screen like in the following screenshot. However, note that the Hosts shown here will not be the same as the ones you get. Take a quick look at what is on the Sources tab and the Sourcetypes tab. Then find the most recent data (in this case 127.0.0.1) and click on it. Data Summary, where you can see Hosts, Sources, and Sourcetypes After" clicking on the most recent data, which in "this case is bps-T341s, look at the events contained there. Later, when we use streaming data, we can see how the events at the top of this list change rapidly. Here, you will see a listing of events, similar to those shown in the "following screenshot: Events lists for the host value You can click on the Splunk logo in the upper-left corner "of the web page to return to the home page. Under Administrator at the "top-right of the page, click on Logout. Searching Twitter data We will start here by doing a simple search of our Twitter index, which is automatically created by the app once you have enabled Twitter input (as explained previously). In our earlier searches, we used the default index (which the tutorial data was downloaded to), so we didn't have to specify the index we wanted to use. Here, we will use just the Twitter index, so we need to specify that in the search. A simple search Imagine that we wanted to search for tweets containing the word coffee. We could use the code presented here and place it in the search bar: index=twitter text=*coffee* The preceding code searches only your Twitter index and finds all the places where the word coffee is mentioned. You have to put asterisks there, otherwise you will only get the tweets with just "coffee". (Note that the text field is not case sensitive, so tweets with either "coffee" or "Coffee" will be included in the search results.) The asterisks are included before and after the text "coffee" because otherwise we would only get events where just "coffee" was tweeted – a rather rare occurrence, we expect. In fact, when we search our indexed Twitter data without the asterisks around coffee, we got no results. Examining the Twitter event Before going further, it is useful to stop and closely examine the events that are collected as part of the search. The sample tweet shown in the following screenshot shows the large number of fields that are part of each tweet. The > was clicked to expand the event: A Twitter event There are several items to look closely at here: _time: Splunk assigns a timestamp for every event. This is done in UTC (Coordinated Universal Time) time format. contributors: The value for this field is null, as are the values of many Twitter fields. Retweeted_status: Notice the {+} here; in the following event list, you will see there are a number of fields associated with this, which can be seen when the + is selected and the list is expanded. This is the case wherever you see a {+} in a list of fields: Various retweet fields In addition to those shown previously, there are many other fields associated with a tweet. The 140 character (maximum) text field that most people consider to be the tweet is actually a small part of the actual data collected. The implied AND If you want to search on more than one term, there is no need to add AND as it is already implied. If, for example, you want to search for all tweets that include both the text "coffee" and the text "morning", then use: index=twitter text=*coffee* text=*morning* If you don't specify text= for the second term and just put *morning*, Splunk assumes that you want to search for *morning* in any field. Therefore, you could get that word in another field in an event. This isn't very likely in this case, although coffee could conceivably be part of a user's name, such as "coffeelover". But if you were searching for other text strings, such as a computer term like log or error, such terms could be found in a number of fields. So specifying the field you are interested in would be very important. The need to specify OR Unlike AND, you must always specify the word OR. For example, to obtain all events that mention either coffee or morning, enter: index=twitter text=*coffee* OR text=*morning* Finding other words used Sometimes you might want to find out what other words are used in tweets about coffee. You can do that with the following search: index=twitter text=*coffee* | makemv text | mvexpand text | top 30 text This search first searches for the word "coffee" in a text field, then creates a multivalued field from the tweet, and then expands it so that each word is treated as a separate piece of text. Then it takes the top 30 words that it finds. You might be asking yourself how you would use this kind of information. This type of analysis would be of interest to a marketer, who might want to use words that appear to be associated with coffee in composing the script for an advertisement. The following screenshot shows the results that appear (1 of 2 pages). From this search, we can see that the words love, good, and cold might be words worth considering: Search of top 30 text fields found with *coffee* When you do a search like this, you will notice that there are a lot of filler words (a, to, for, and so on) that appear. You can do two things to remedy this. You can increase the limit for top words so that you can see more of the words that come up, or you can rerun the search using the following code. "Coffee" (with a capital C) is listed (on the unshown second page) separately here from "coffee". The reason for this is that while the search is not case sensitive (thus both "coffee" and "Coffee" are picked up when you search on "coffee"), the process of putting the text fields through the makemv and the mvexpand processes ends up distinguishing on the basis of case. We could rerun the search, excluding some of the filler words, using the code shown here: index=twitter text=*coffee* | makemv text | mvexpand text |search NOT text="RT" AND NOT text="a" AND NOT text="to" ANDNOT text="the" | top 30 text Using a lookup table Sometimes it is useful to use a lookup file to avoid having to use repetitive code. It would help us to have a list of all the small words that might be found often in a tweet just by the nature of each word's frequent use in language, so that we might eliminate them from our quest to find words that would be relevant for use in the creation of advertising. If we had a file of such small words, we could use a command indicating not to use any of these more common, irrelevant words when listing the top 30 words associated with our search topic of interest. Thus, for our search for words associated with the text "coffee", we would be interested in words like " dark", "flavorful", and "strong", but not words like "a", "the", and "then". We can do this using a lookup command. There are three types of lookup commands, which are presented in the following table: Command Description lookup Matches a value of one field with a value of another, based on a .csv file with the two fields. Consider a lookup table named lutable that contains fields for machine_name and owner. Consider what happens when the following code snippet is used after a preceding search (indicated by . . . |): . . . | lookup lutable owner Splunk will use the lookup table to match the owner's name with its machine_name and add the machine_name to each event. inputlookup All fields in the .csv file are returned as results. If the following code snippet is used, both machine_name and owner would be searched: . . . | inputlookup lutable outputlookup This code outputs search results to a lookup table. The following code outputs results from the preceding research directly into a table it creates: . . . | outputlookup newtable.csv saves The command we will use here is inputlookup, because we want to reference a .csv file we can create that will include words that we want to filter out as we seek to find possible advertising words associated with coffee. Let's call the .csv file filtered_words.csv, and give it just a single text field, containing words like "is", "the", and "then". Let's rewrite the search to look like the following code: index=twitter text=*coffee*| makemv text | mvexpand text| search NOT [inputlookup filtered_words | fields text ]| top 30 text Using the preceding code, Splunk will search our Twitter index for *coffee*, and then expand the text field so that individual words are separated out. Then it will look for words that do NOT match any of the words in our filtered_words.csv file, and finally output the top 30 most frequently found words among those. As you can see, the lookup table can be very useful. To learn more about Splunk lookup tables, go to http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchReference/Lookup. Summary In this article, we have learned more about how to use Splunk to create reports, dashboards. Splunk Enterprise Software, or Splunk, is an extremely powerful tool for searching, exploring, and visualizing data of all types. Splunk is becoming increasingly popular, as more and more businesses, both large and small, discover its ease and usefulness. Analysts, managers, students, and others can quickly learn how to use the data from their systems, networks, web traffic, and social media to make attractive and informative reports. This is a straightforward, practical, and quick introduction to Splunk that should have you making reports and gaining insights from your data in no time. Resources for Article: Further resources on this subject: Lookups [article] Working with Apps in Splunk [article] Loading data, creating an app, and adding dashboards and reports in Splunk [article]
Read more
  • 0
  • 0
  • 11723

article-image-combine-data-files-within-ibm-spss-modeler
Amey Varangaonkar
22 Feb 2018
6 min read
Save for later

How to combine data files within IBM SPSS Modeler

Amey Varangaonkar
22 Feb 2018
6 min read
[box type="note" align="" class="" width=""]The following extract is taken from the book IBM SPSS Modeler Essentials, written by Keith McCormick and Jesus Salcedo. SPSS Modeler is one of the popularly used enterprise tools for data mining and predictive analytics. [/box] In this article, we will explore how SPSS Modeler can be effectively used to combine different file types for efficient data modeling. In many organizations, different pieces of information for the same individuals are held in separate locations. To be able to analyze such information within Modeler, the data files must be combined into one single file. The Merge node joins two or more data sources so that information held for an individual in different locations can be analyzed collectively. The following diagram shows how the Merge node can be used to combine two separate data files that contain different types of information: Like the Append node, the Merge node is found in the Record Ops palette. This node takes multiple data sources and creates a single source containing all or some of the input fields. Let's go through an example of how to use the Merge node to combine data files: Open the Merge stream. The Merge stream contains the files we previously appended, as well as the main data file we were working with in earlier chapters. 2. Place a Merge node from the Record Ops palette on the canvas. 3. Connect the last Reclassify node to the Merge node. 4. Connect the Filter node to the Merge node. [box type="info" align="" class="" width=""]Like the Append node, the order in which data sources are connected to the Merge node impacts the order in which the sources are displayed. The fields of the first source connected to the Merge node will appear first, followed by the fields of the second source connected to the Merge node, and so on.[/box] 5. Connect the Merge node to a Table node: 6. Edit the Merge node. Since the Merge node can cope with a variety of different situations, the Merge tab allows you to specify the merging method. There are four methods for merging: Order: It joins the first record in the first dataset with the first record in the second dataset, and so on. If any of the datasets run out of records, no further output records are produced. This method can be dangerous if there happens to be any cases that are missing from a file, or if files have been sorted differently. Keys: It is the most commonly used method, used when records that have the same value in the field(s) defined as the key are merged. If multiple records contain the same value on the key field, all possible merges are returned. Condition: It joins records from files that meet a specified condition. Ranked condition: It specifies whether each row pairing in the primary dataset and all secondary datasets are to be merged; use the ranking expression to sort any multiple matches from low to high order. Let's combine these files. To do this: Set Merge Method to Keys. Fields contained in all input sources appear in the Possible keys list. To identify one of more fields as the key field(s), move the selected field into the Keys for merge list. In our case, there are two fields that appear in both files, ID and Year. 2. Select ID in the Possible keys list and place it into the Keys for merge list: There are five major methods of merging using a key field: Include only matching records (inner join) merges only complete records, that is, records; that are available in all datasets. Include matching and non-matching records (full outer join) merges records that appear in any of the datasets; that is, the incomplete records are still retained. The undefined value ($null$) is added to the missing fields and included in the output. Include matching and selected non-matching records (partial outerjoin) performs left and right outer-joins. All records from the specified file are retained, along with only those records from the other file(s) that match records in the specified file on the key field(s). The Select... button allows you to designate which file is to contribute incomplete records. Include records in first dataset not matching any others (anti-join) provides an easy way of identifying records in a dataset that do not have records with the same key values in any of the other datasets involved in the merge. This option only retains records from the dataset that match with no other records. Combine duplicate key fields is the final option in this dialog, and it deals with the problem of duplicate field names (one from each dataset) when key fields are used. This option ensures that there is only one output field with a given name, and this is enabled by default. The Filter tab The Filter tab lists the data sources involved in the merge, and the ordering of the sources determines the field ordering of the merged data. Here, you can rename and remove fields. Earlier, we saw that the field Year appeared in both datasets; here we can remove one version of this field (we could also rename one version of the field to keep both): Click on the arrow next to the second Year field: The second Year field will no longer appear in the combined data file. The Optimization tab The Optimization tab provides two options that allow you to merge data more efficiently when one input dataset is significantly larger than the other datasets, or when the data is already presorted by all or some of the key fields that you are using to merge: Click OK. Run the Table: All of these files have now been combined. The resulting table should have 44 fields and 143,531 records. We saw how the Merge node is used to join data files that contain different information for the same records. If you found this post useful, make sure to check out IBM SPSS Modeler Essentials for more information on leveraging SPSS Modeler to get faster and efficient insights from your data.  
Read more
  • 0
  • 0
  • 11713

article-image-top-research-papers-showcased-nips-2017-part-1
Sugandha Lahoti
07 Dec 2017
6 min read
Save for later

Top Research papers showcased at NIPS 2017 - Part 1

Sugandha Lahoti
07 Dec 2017
6 min read
The ongoing 31st annual Conference on Neural Information Processing Systems (NIPS 2017) in Long Beach, California is scheduled from December 4-9, 2017. The 6-day conference is hosting a number of invited talks, demonstrations, tutorials, and paper presentations pertaining to the latest in machine learning, deep learning and AI research. This year the conference has grown larger than life with a record-high 3,240 papers, 678 selected ones, and a completely sold-out event. Top tech members from Google, Microsoft, IBM, DeepMind, Facebook, Amazon, are among other prominent players who enthusiastically participated this year. Here is a quick roundup of some of the top research papers till date. Generative Adversarial Networks Generative Adversarial Networks are a hot topic of research at the ongoing NIPS conference. GANs cast out an easy way to train the DL algorithms by slashing out the amount of data required in training with unlabelled data. Here are a few research papers on GANs. Regularization can stabilize training of GANs Microsoft researchers have proposed a new regularization approach to yield a stable GAN training procedure at low computational costs. Their new model overcomes the fundamental limitation of GANs occurring due to a dimensional mismatch between the model distribution and the true distribution. This results in their density ratio and the associated f-divergence to be undefined. Their paper “Stabilizing Training of Generative Adversarial Networks through Regularization” turns GAN models into reliable building blocks for deep learning. They have also used this for several datasets including image generation tasks. AdaGAN: Boosting GAN Performance Training GANs can at times be a hard task. They can also suffer from the problem of missing modes where the model is not able to produce examples in certain regions of the space. Google researchers have developed an iterative procedure called AdaGAN in their paper “AdaGAN: Boosting Generative Models”, an approach inspired by boosting algorithms, where many potentially weak individual predictors are greedily aggregated to form a strong composite predictor. It adds a new component into a mixture model at every step by running a GAN algorithm on a re-weighted sample. The paper also addresses the problem of missing modes. Houdini: Generating Adversarial Examples The generation of adversarial examples is considered as a critical milestone for evaluating and upgrading the robustness of learning in machines. Also, current methods are confined to classification tasks and are unable to alter the performance measure of the problem at hand. In order to tackle such an issue, Facebook researchers have come up with a research paper titled “Houdini: Fooling Deep Structured Prediction Models”, a novel and a flexible approach for generating adversarial examples distinctly tailormade for the final performance measure of the task taken into account (combinational and non-decomposable tasks). Stochastic hard-attention for Memory Addressing in GANs DeepMind researchers showcased a new method which uses stochastic hard-attention to retrieve memory content in generative models. Their paper titled “Variational memory addressing in generative models” was presented at the 2nd day of the conference and is an advancement over the popular differentiable soft-attention mechanism. Their new technique allows developers to apply variational inference to memory addressing. This leads to more precise memory lookups using target information, especially in models with large memory buffers and with many confounding entries in the memory. Image and Video Processing A lot of hype was also around developing sophisticated models and techniques for image and video processing. Here is a quick glance at the top presentations. Fader Networks: Image manipulation through disentanglement Facebook researchers have introduced Fader Networks, in their paper titled “Fader Networks: Manipulating Images by Sliding Attributes”. These fader networks use an encoder-decoder architecture to reconstruct images by disentangling their salient information and the values of particular attributes directly in a latent space. Disentanglement helps in manipulating these attributes to generate variations of pictures of faces while preserving their naturalness. This innovative approach results in much simpler training schemes and scales for manipulating multiple attributes jointly. Visual interaction networks for Video simulation Another paper titled “Visual interaction networks: Learning a physics simulator from video Tuesday” proposes a new neural-network model to learn physical objects without prior knowledge. Deepmind’s Visual Interaction Network is used for video analysis and is able to infer the states of multiple physical objects from just a few frames of video. It then uses these to predict object positions many steps into the future. It can also deduce the locations of invisible objects. Transfer, Reinforcement, and Continual Learning A lot of research is going on in the field of Transfer, Reinforcement, and Continual learning to make stable and powerful deep learning models. Here are a few research papers presented in this domain. Two new techniques for Transfer Learning Currently, a large set of input/output (I/O) examples are required for learning any underlying input-output mapping. By leveraging information based on the related tasks, the researchers at Microsoft have addressed the problem of data and computation efficiency of program induction. Their paper “Neural Program Meta-Induction” uses two approaches for cross-task knowledge transfer. First is Portfolio adaption, where a set of induction models is pretrained on a set of related tasks, and the best model is adapted towards the new task using transfer learning. The second one is Meta program induction, a k-shot learning approach which makes a model generalize itself to new tasks without requiring any additional training. Hybrid Reward Architecture to solve the problem of generalization in Reinforcement Learning A new paper from Microsoft “Hybrid Reward Architecture for Reinforcement Learning” highlights a new method to address the generalization problem faced by a typical deep RL method. Hybrid Reward Architecture (HRA) takes a decomposed reward function as the input and learns a separate value function for each component reward function. This is especially useful in domains where the optimal value function cannot easily be reduced to a low-dimensional representation. In the new approach, the overall value function is much smoother and can be easier approximated by a low-dimensional representation, enabling more effective learning. Gradient Episodic Memory to counter catastrophic forgetting in continual learning models Continual learning is all about improving the ability of models to solve sequential tasks without forgetting previously acquired knowledge. In the paper “Gradient Episodic Memory for Continual Learning”, Facebook researchers have proposed a set of metrics to evaluate models over a continuous series of data. These metrics characterize models by their test accuracy and the ability to transfer knowledge across tasks. They have also proposed a model for continual learning, called Gradient Episodic Memory (GEM) that reduces the problem of catastrophic forgetting. They have also experimented with variants of the MNIST and CIFAR-100 datasets to demonstrate the performance of GEM when compared to other methods. In our next post, we will cover a selection of papers presented so far at NIPS 2017 in the areas of Predictive Modelling, Machine Translation, and more. For live content coverage, you can visit NIPS’ Facebook page.
Read more
  • 0
  • 0
  • 11710

article-image-how-google-mapreduce-works-big-data-projects
Sugandha Lahoti
15 Dec 2017
7 min read
Save for later

How Google's MapReduce works and why it matters for Big Data projects

Sugandha Lahoti
15 Dec 2017
7 min read
[box type="note" align="" class="" width=""]The article given below is a book extract from Java Data Analysis written by John R. Hubbard. The book will give you the most out of popular Java libraries and tools to perform efficient data analysis.[/box] In this article, we will explore Google’s MapReduce framework to analyze big data. How do you quickly sort a list of billion elements? Or multiply two matrices, each with a million rows and a million columns? In implementing their PageRank algorithm, Google quickly discovered the need for a systematic framework for processing massive datasets. That could be done only by distributing the data and the processing over many storage units and processors. Implementing a single algorithm, such as PageRank in that environment is difficult, and maintaining the implementation as the dataset grows is even more challenging. The solution: MapReduce framework The answer lay in separating the software into two levels: a framework that manages the big data access and parallel processing at a lower level, and a couple of user-written methods at an upper-level. The independent user who writes the two methods need not be concerned with the details of the big data management at the lower level. How does it function Specifically, the data flows through a sequence of stages: The input stage divides the input into chunks, usually 64MB or 128MB. The mapping stage applies a user-defined map() function that generates from one key-value pair a larger collection of key-value pairs of a different type. The partition/grouping stage applies hash sharding to those keys to group them. The reduction stage applies a user-defined reduce() function to apply some specific algorithm to the data in the value of each key-value pair. The output stage writes the output from the reduce() method. The user's choice of map() and reduce() methods determines the outcome of the entire process; hence the name MapReduce. This idea is a variation on the old algorithmic paradigm called divide and conquer. Think of the proto-typical mergesort, where an array is sorted by repeatedly dividing it into two halves until the pieces have only one element, and then they are systematically pairwise merged back together. MapReduce is actually a meta-algorithm—a framework, within which specific algorithms can be implemented through its map() and reduce() methods. Extremely powerful, it has been used to sort a petabyte of data in only a few hours. Recall that a petabyte is 10005 = 1015 bytes, which is a thousand terabytes or a million gigabytes. Some examples of MapReduce applications Here are a few examples of big data problems that can be solved with the MapReduce framework: Given a repository of text files, find the frequency of each word. This is called the WordCount problem. Given a repository of text files, find the number of words of each word length. Given two matrices in a sparse matrix format, compute their product. Factor a matrix given in sparse matrix format. Given a symmetric graph whose nodes represent people and edges represent friendship, compile a list of common friends. Given a symmetric graph whose nodes represent people and edges represent friendship, compute the average number of friends by age. Given a repository of weather records, find the annual global minima and maxima by year. Sort a large list. Note that in most implementations of the MapReduce framework, this problem is trivial, because the framework automatically sorts the output from the map() function. Reverse a graph. Find a minimal spanning tree (MST) of a given weighted graph. Join two large relational database tables. The WordCount example In this section, we present the MapReduce solution to the WordCount problem, sometimes called the Hello World example for MapReduce. The diagram in the figure below shows the data flow for the WordCount program. On the left are two of the 80 files that are read into the program: During the mapping stage, each word, followed by the number 1, is copied into a temporary file, one pair per line. Notice that many words are duplicated many times. For example, image appears five times among the 80 files (including both files shown), so the string image 1 will appear four times in the temporary file. Each of the input files has about 110 words, so over 8,000 word-number pairs will be written to the temporary file. Note that this figure shows only a very small part of the data involved. The output from the mapping stage includes every word that is input, as many times that it appears. And the output from the grouping stage includes every one of those words, but without duplication. The grouping process reads all the words from the temporary file into a key-value hash table, where the key is the word, and the value is a string of 1s, one for each occurrence of that word in the temporary file. Notice that these 1s written to the temporary file are not used. They are included simply because the MapReduce framework in general expects the map() function to generate key-value pairs.The reducing stage transcribed the contents of the hash table to an output file, replacing each string of 1s with the number of them. For example, the key-value pair ("book", "1 1 1 1")  is written as book 4 in the output file. Keep in mind that this is a toy example of the MapReduce process. The input consists of 80 text files containing about 9073 words. So, the temporary file has 9073 lines, with one word per line. Only 2149 of those words are distinct, so the hash table has 2149 entries and the output file has 2149 lines, with one word per line. The main idea So, this is the main idea of the MapReduce meta-algorithm: provide a framework for processing massive datasets, a framework that allows the independent programmer to plug in specialized map() and reduce() methods that actually implement the required particular algorithm. If that particular algorithm is to count words, then write the map() method to extract each individual word from a specified file and write the key-value pair (word, 1) to wherever the specified writer will put them, and write the reduce() method to take a key-value pair such as (word, 1 1 1 1) and return the corresponding key-value pair as (word, 4) to wherever its specified writer will put it. These two methods are completely localized—they simply operate on key-value pairs. And, they are completely independent of the size of the dataset. The diagram below illustrates the general flow of data through an application of the MapReduce framework: The original dataset could be in various forms and locations: a few files in a local directory, a large collection of files distributed over several nodes on the same cluster, a database on a database system (relational or NoSQL), or data sources available on the World Wide Web. The MapReduce controller then carries out these five tasks: Split the data into smaller datasets, each of which can be easily accessed on a single machine. Simultaneously (that is, in parallel), run a copy of the user-supplied map() method, one on each dataset, producing a set of key-value pairs in a temporary file on that local machine. Redistribute the datasets among the machines, so that all instances of each key are in the same dataset. This is typically done by hashing the keys. Simultaneously (in parallel), run a copy of the user-supplied reduce() method, one on each of the temporary files, producing one output file on each machine. Combine the output files into a single result. If the reduce() method also sorts its output, then this last step could also include merging those outputs. The genius of the MapReduce framework is that it separates the data management (moving, partitioning, grouping, sorting, and so on) from the data crunching (counting, averaging, maximizing, and so on). The former is done with no attention required by the user. The latter is done in parallel, separately in each node, by invoking the two user-supplied methods map() and reduce(). Essentially, the only obligation of the user is to devise the correct implementations of these two methods that will solve the given problem. As we mentioned earlier, these examples are presented mainly to elucidate how the MapReduce algorithm works. Real-world implementations would, however, use MongoDB or Hadoop frameworks. If you enjoyed this excerpt, check out the book Java Data Analysis to get an understanding of the various data analysis techniques, and how to implement them using Java.  
Read more
  • 0
  • 0
  • 11694

article-image-generate-prime-numbers-ancient-greek-connection
Richard Gall
16 Mar 2018
2 min read
Save for later

Prime numbers, modern encryption and their ancient Greek connection!

Richard Gall
16 Mar 2018
2 min read
Prime numbers are incredibly important in a number of fields, from computer science to cybersecurity. But they are incredibly mysterious - they are quite literally enigmas. They have been the subject of thousands of years of research and exploration but they have still not been cracked - we are yet to find a formula that will help generate prime numbers easily. Prime numbers are particularly integral to modern encryption - when a file is encrypted, the number used to do so is built using two primes. The only way to decrypt it is to work out the prime factors of that gargantuan number, a task which is almost impossible even with the most extensive computing power currently at our disposal. As well as this, prime numbers are also used as error correcting codes and in mass storage and data transmission. Did you know the Greeks were one of the early champions for our modern encryption systems? Find out how to generate prime numbers manually using a method devised by the Greek mathematician, Eratothenes, in this fun video from the video course by Packt, Fundamental Algorithms in Scala. [embed]https://www.youtube.com/watch?v=cd8v-Jo8obs&t=56s[/embed]  
Read more
  • 0
  • 0
  • 11644
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-storing-apache-storm-data-in-elasticsearch
Richa Tripathi
27 Dec 2017
6 min read
Save for later

Storing Apache Storm data in Elasticsearch

Richa Tripathi
27 Dec 2017
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Ankit Jain titled Mastering Apache Storm. This book explores various real-time processing functionalities offered by Apache Storm such as parallelism, data partitioning, and more.[/box] In this article, we are going to cover how to store the data processed by Apache Storm in Elasticsearch. Elasticsearch is an open source, distributed search engine platform developed on Lucene. It provides a multitenant-capable, full-text search engine capability.Though Apache storm is meant for real-time data processing, in most cases, we need to store the processed data in a data store so that it can be used for further batch analysis and to execute the batch analysis queries on the data stored. We assume that Elasticsearch is running on your environment. Please refer to https://www.elastic.co/guide/en/elasticsearch/reference/2.3/_installation.html to install Elasticsearch on any of the boxes if you don't have any running Elasticsearch cluster. Go through the following steps to integrate Storm with Elasticsearch: Create a Maven project using com.stormadvance for the groupID and storm_elasticsearch for the artifactID. Add the following dependencies and repositories to the pom.xml file: <dependencies> <dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch</artifactId> <version>2.4.4</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.storm</groupId> <artifactId>storm-core</artifactId> <version>1.0.2</version> <scope>provided</scope> </dependency> </dependencies> 3. Create an ElasticSearchOperation class in the com.stormadvance.storm_elasticsearch package.The ElasticSearchOperation class contains the following method: insert(Map<String, Object> data, String indexName, String indexMapping, String indexId): This method takes the record data, indexName, indexMapping, and indexId as input, and inserts the input record in Elasticsearch. The following is the source code of the ElasticSearchOperation class: public class ElasticSearchOperation { private TransportClient client; public ElasticSearchOperation(List<String> esNodes) throws Exception { try { Settings settings = Settings.settingsBuilder() .put("cluster.name", "elasticsearch").build(); client = TransportClient.builder().settings(settings).build(); for (String esNode : esNodes) { client.addTransportAddress(new InetSocketTransportAddress( InetAddress.getByName(esNode), 9300)); } } catch (Exception e) { throw e; } } public void insert(Map<String, Object> data, String indexName, String indexMapping, String indexId) { client.prepareIndex(indexName, indexMapping, indexId) .setSource(data).get(); } public static void main(String[] s){ try{ List<String> esNodes = new ArrayList<String>(); esNodes.add("127.0.0.1"); ElasticSearchOperation elasticSearchOperation = new ElasticSearchOperation(esNodes); Map<String, Object> data = new HashMap<String, Object>(); data.put("name", "name"); data.put("add", "add"); elasticSearchOperation.insert(data,"indexName","indexMapping",UUID. randomUUID().toString()); }catch(Exception e) { e.printStackTrace(); //System.out.println(e); } } } 4. Create a SampleSpout class in the com.stormadvance.stormhbase package. This class generates random records and passes them to the next action (bolt) in the topology. The following is the format of the record generated by the SampleSpout class: ["john","watson","abc"] The following is the source code of the SampleSpout class: public class SampleSpout extends BaseRichSpout { private static final long serialVersionUID = 1L; private SpoutOutputCollector spoutOutputCollector; private static final Map<Integer, String> FIRSTNAMEMAP = new HashMap<Integer, String>(); static { FIRSTNAMEMAP.put(0, "john"); FIRSTNAMEMAP.put(1, "nick"); FIRSTNAMEMAP.put(2, "mick"); FIRSTNAMEMAP.put(3, "tom"); FIRSTNAMEMAP.put(4, "jerry"); } private static final Map<Integer, String> LASTNAME = new   HashMap<Integer, String>(); static { LASTNAME.put(0, "anderson"); LASTNAME.put(1, "watson"); LASTNAME.put(2, "ponting"); LASTNAME.put(3, "dravid"); LASTNAME.put(4, "lara"); } private static final Map<Integer, String> COMPANYNAME = new HashMap<Integer, String>(); static { COMPANYNAME.put(0, "abc"); COMPANYNAME.put(1, "dfg"); COMPANYNAME.put(2, "pqr"); COMPANYNAME.put(3, "ecd"); COMPANYNAME.put(4, "awe"); } public void open(Map conf, TopologyContext context, SpoutOutputCollector spoutOutputCollector) { // Open the spout this.spoutOutputCollector = spoutOutputCollector; } public void nextTuple() { // Storm cluster repeatedly call this method to emit the continuous // // stream of tuples. final Random rand = new Random(); // generate the random number from 0 to 4. int randomNumber = rand.nextInt(5); spoutOutputCollector.emit (new Values(FIRSTNAMEMAP.get(randomNumber),LASTNAME.get(randomNumber),CO MPANYNAME.get(randomNumber))); } public void declareOutputFields(OutputFieldsDeclarer declarer) { // emits the field   firstName , lastName and companyName. declarer.declare(new Fields("firstName","lastName","companyName")); } } 5. Create an ESBolt class in the com.stormadvance.storm_elasticsearch package. This bolt receives the tuples emitted by the SampleSpout class, converts it to the Map structure, and then calls the insert() method of the ElasticSearchOperation class to insert the record into Elasticsearch. The following is the source code of the ESBolt class: public class ESBolt implements IBasicBolt { private static final long serialVersionUID = 2L; private ElasticSearchOperation elasticSearchOperation; private List<String> esNodes; /** * * @param esNodes */ public ESBolt(List<String> esNodes) { this.esNodes = esNodes; } public void execute(Tuple input, BasicOutputCollector collector) { Map<String, Object> personalMap = new HashMap<String, Object>(); // "firstName","lastName","companyName") personalMap.put("firstName", input.getValueByField("firstName")); personalMap.put("lastName", input.getValueByField("lastName")); personalMap.put("companyName", input.getValueByField("companyName")); elasticSearchOperation.insert(personalMap,"person","personmapping", UUID.randomUUID().toString()); } public void declareOutputFields(OutputFieldsDeclarer declarer) { } public Map<String, Object> getComponentConfiguration() { // TODO Auto-generated method stub return null; } public void prepare(Map stormConf, TopologyContext context) { try { // create the instance of ESOperations class elasticSearchOperation = new  ElasticSearchOperation(esNodes); } catch (Exception e) { throw new RuntimeException(); } } public void cleanup() { } } 6.  Create an ESTopology class in the com.stormadvance.storm_elasticsearch package. This class creates an instance of the spout and bolt classes and chains them together using a TopologyBuilder class. The following is the implementation of the main class: public class ESTopology { public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException { TopologyBuilder builder = new TopologyBuilder(); //ES Node list List<String> esNodes = new ArrayList<String>(); esNodes.add("10.191.209.14"); // set the spout class builder.setSpout("spout", new SampleSpout(), 2); // set the ES bolt class builder.setBolt("bolt", new ESBolt(esNodes), 2) .shuffleGrouping("spout"); Config conf = new Config(); conf.setDebug(true); // create an instance of LocalCluster class for // executing topology in local mode. LocalCluster cluster = new LocalCluster(); // ESTopology is the name of submitted topology. cluster.submitTopology("ESTopology", conf, builder.createTopology()); try { Thread.sleep(60000); } catch (Exception exception) { System.out.println("Thread interrupted exception : " + exception); } System.out.println("Stopped Called : "); // kill the LearningStormTopology cluster.killTopology("StormHBaseTopology"); // shutdown the storm test cluster cluster.shutdown(); } } To summarize we covered how we can store the data processed by Apache Storm into Elasticsearch by making the connection with Elasticsearch nodes inside the Storm bolts. If you enjoyed this post, check out the book Mastering Apache Storm to know more about different types of real time processing techniques used to create distributed applications.
Read more
  • 0
  • 0
  • 11609

article-image-features-and-utilities-sql-developer-data-modeler
Packt
07 Jan 2010
8 min read
Save for later

Features and utilities in SQL Developer Data Modeler

Packt
07 Jan 2010
8 min read
Oracle SQL Developer Data Modeler is available as an independent product, providing a focused data modeling tool for data architects and designers. There is also a Data Modeler Viewer extension to SQL Developer, which allows users to open previously created data models and to create read-only models of their database schemas. SQL Developer Data Modeler is a vast tool, supporting the design of logical Entity Relationship Diagrams, and relational models, with forward and reverse engineering capabilities between the two. It supports multi-dimensional, data flow, data type, and physical models, and allows files to be imported from a variety sources and exported to a variety of destinations. It allows users to set naming conventions and verify designs using a set of predefined design rules. Each of these topics is extensive, so in this two-part article by Sue Harper (author of Oracle SQL Developer 2.1)  we'll review a few of the areas, illustrating how you can use them and highlight a few key features, using the independent, stand alone release of SQL Developer Data Modeler. We'll include a rief review of the integration points of the Data Modeler Viewer extension to SQL Developer. The product offers support for Oracle and non-Oracle Databases. In the interest of time and space, we have elected to only work with the Oracle database. Oracle SQL Developer Data Modeler SQL Developer Data Modeler provides users with a lightweight tool which provides application and database developers a quick and easy way of diagrammatically displaying their data structures, making changes, and submitting the new changes to update a schema. In this article, we will not attempt to teach data modeling (except to provide some generally accepted definitions). Instead, we will discuss how the product supports data modeling and a few of the features provided. There are a variety of books available on the subject, which describe and define modeling best practice. Feature overview The Data Modeler supports a number of graphical models and a selection of text-based models. The graphical models are: Logical—this is the entity relationship model or Entity Relationship Diagram (ERD), and comprises entities, attributes, and relationships. Relational—this is the schema or database model and is comprised of tables, columns, views, and constraints. In SQL Developer Data Modeler, these models are database independent, and need to be associated with the physical model to support database specific DDL. Data Types—this is the model that supports modeling SQL99 structured types and for viewing inheritance hierarchies. The data types modeled here are used in both the logical and relational models. Multidimensional models—these models support fact, dimension, and summary classifications for multi-dimensional models. Data Flow—these models support the definition of primitive, composite, and transformational tasks. The following support these graphical models: Domains—these allow you to define and reuse a data type with optional constraints or allowable values. You can use domains in the Logical and Relational models. Physical—this model is associated with a relational model and defines the physical attributes for a specific database and version. Business Information—this allows you to model or document the business details that support a design. Tying these graphical and textual models together are a variety of utilities, which include: Forward and reverse engineering between the Logical and Relational models Import from various databases Export, including DDL script generation, for various databases Design Rules for verifying standards and completeness Name templates, glossary, and abbreviation files for supporting naming standards Integrated architecture SQL Developer Data Modeler is made up of a number of layers, which have a tightly synchronized relationship. The Logical model is thought of as the core of the product, providing the starting point for any design, and feeding details into other models. The following diagram shows an illustration of how the models relate to each other: The logical ERD provides the basis for one or more relational models, and each of these feeds into one or more physical models, which are in turn used for the DDL generation. You can create separate data types models and use the defined data type in either the logical or relational models. Both relational and logical models can have multiple subviews created, and each subview can have many displays created. Getting started SQL Developer Data Modeler is an independent product, and with the exception of the Data Modeler Viewer extension to SQL Developer 2.1, is not packaged with other Oracle tools. You can download it and install it in a directory of your choice, with no impact on other tools. To install, simply unzip the file. Installing and setting up the environment Getting started with SQL Developer Data Modeler is straightforward. Follow the links from the Data Modeler site on OTN, http://www.oracle.com/technology/products/database/datamodeler to the download location. You are offered a choice of files to download: For Microsoft Windows, a ZIP file with or without the JRE included For the Mac OS X, a ZIP file without the JRE included For Linux, a ZIP file without the JRE included For any of these ZIP files, extract the file contents and run the datamodeler.exe, which is in the top-level /datamodeler folder, or in the /datamodeler/bin folder. For Linux, use the datamodeler.sh executable. If the file you choose does not include a JRE, you will be prompted on startup for the location of your installed JRE. The minimum supported release is JRE 1.6 update 6.0. Oracle clients and JDBC drivers If you are designing and building a model from scratch, or have access to the DDL script file for importing models, then you do not need to have access to a database. However, if you want to import from a database, you'll need to create a database connection. In this case, there is no need for an Oracle client in your development environment because you can use the thin JDBC Driver to connect to the database. SQL Developer Data Modeler also supports the TNS alias. Therefore, if you have access to a tnsnames.ora file, or have other Oracle software installed in your environment, you can access the tnsnames file to make the database connection if and when required. Creating your first models The Data Modeler browser starts with empty Logical and Relational models. This allows you to start a new design and build a model from scratch, whether a logical model with entities and attributes, or a relational model with tables and columns. The Data Modeler also supports metadata to be imported from a variety of sources, which include: Importing metadata from: DDL scripts Data dictionary Importing from other modeling tools: Oracle Designer CA Erwin 4.x Importing other formats: VAR file XMLA (Microsoft, Hyperion) The context menu displaying the choices available is shown in the following screenshot: Once you have created and saved your models, you can open these or share them with colleagues. To open an existing model, use the menu: File | Open—browse to the location of the files, which then opens the full design with all of the saved models File | Recent Designs—opens the full design, with all of the saved models, with no need to first search for the location File | Import | Data Modeler Design—more granular, offering a choice of models saved in a set of models Recent diagramsUse File | Recent Diagrams to display a list of all diagrams you have recently worked on and saved. Using this approach saves you from needing to browse to the location of the stored files. Importing from the Data Dictionary There are many ways to start using the tool by just starting to draw any one of the model types mentioned. In the screenshot shown earlier, we highlighted the File | Import | Data Dictionary option. Using this allows you to import from Oracle 9i, Oracle 10g, Oracle Database 11g, Microsoft SQL Server 2000 and 2005, and IBM DB2 LUW Versions 7 and 8. Creating a database connection Before you can import from any database, you need to create a database connection for each database you'll connect to. Once created, you'll see all of the schemas in the database and the objects you have access to. Access the New Database Connection dialog from the File | Import wizard (seen in the following screenshot). If you have no connections, click on Add to create a new connection. For a Basic connection, you need to provide the Hostname of the database server, the Port, and SID. The connection dialog also supports TNS alias and the advanced JDBC URL. Before you can add connections for non-Oracle databases, you need to add the required JDBC drivers. To add these drivers, use Tools | General Options | Third Party JDBC Drivers.
Read more
  • 0
  • 0
  • 11527

article-image-how-to-run-spark-in-mesos
Sunith Shetty
31 Jan 2018
6 min read
Save for later

How to run Spark in Mesos

Sunith Shetty
31 Jan 2018
6 min read
This article is an excerpt from a book written by Muhammad Asif Abbasi titled Learning Apache Spark 2. In this book, you will learn how to perform big data analytics using Spark streaming, machine learning techniques and more. From the article given below, you will learn how to operate Spark in Mesos cluster manager. What is Mesos? Mesos is an open source cluster manager started as a UC Berkley research project in 2008 and quite widely used by a number of organizations. Spark supports Mesos, and Matei Zahria has given a keynote at Mesos Con in June of 2016. Here is a link to the YouTube video of the keynote. Before you start If you haven't installed Mesos previously, the getting started page on the Apache website gives a good walk through of installing Mesos on Windows, MacOS, and Linux. Follow the URL https://mesos.apache.org/getting-started/. Once installed you need to start-up Mesos on your cluster Starting Mesos Master: ./bin/mesos-master.sh -ip=[MasterIP] -workdir=/var/lib/mesos Start Mesos Agents on all your worker nodes: ./bin/mesos-agent.sh - master=[MasterIp]:5050 -work-dir=/var/lib/mesos Make sure Mesos is up and running with all your relevant worker nodes configured: http://[MasterIP]@5050 Make sure that Spark binary packages are available and accessible by Mesos. They can be placed on a Hadoop-accessible URI for example: HTTP via http:// S3 via s3n:// HDFS via hdfs:// You can also install spark in the same location on all the Mesos slaves, and configure spark.mesos.executor.home to point to that location. Running in Mesos Mesos can have single or multiple masters, which means the Master URL differs when submitting application from Spark via mesos: Single Master Mesos://sparkmaster:5050 Multiple Masters (Using Zookeeper) Mesos://zk://master1:2181, master2:2181/mesos Modes of operation in Mesos Mesos supports both the Client and Cluster modes of operation: Client mode Before running the client mode, you need to perform couple of configurations: Spark-env.sh Export MESOS_NATIVE_JAVA_LIBRARY=<Path to libmesos.so [Linux]> or <Path to libmesos.dylib[MacOS]> Export SPARK_EXECUTOR_URI=<URI of Spark zipped file uploaded to an accessible location e.g. HTTP, HDFS, S3> Set spark.executor.uri to URI of Spark zipped file uploaded to an accessible location e.g. HTTP, HDFS, S3 Batch Applications For batch applications, in your application program you need to pass on the Mesos URL as the master when creating your Spark context. As an example: val sparkConf = new SparkConf()                .setMaster("mesos://mesosmaster:5050")                .setAppName("Batch Application")                .set("spark.executor.uri", "Location to Spark binaries                (Http, S3, or HDFS)") val sc = new SparkContext(sparkConf) If you are using Spark-submit, you can configure the URI in the conf/sparkdefaults.conf file using spark.executor.uri. Interactive applications When you are running one of the provided spark shells for interactive querying, you can pass the master argument e.g: ./bin/spark-shell -master mesos://mesosmaster:5050 Cluster mode Just as in YARN, you run spark on mesos in a cluster mode, which means the driver is launched inside the cluster and the client can disconnect after submitting the application, and get results from the Mesos WebUI. Steps to use the cluster mode Start the MesosClusterDispatcher in your cluster: ./sbin/start-mesos-dispatcher.sh -master mesos://mesosmaster:5050. This will generally start the dispatcher at port 7077. From the client, submit a job to the mesos cluster by Spark-submit specifying the dispatcher URL. Example:        ./bin/spark-submit        --class org.apache.spark.examples.SparkPi        --master mesos://dispatcher:7077        --deploy-mode cluster        --supervise        --executor-memory 2G        --total-executor-cores 10        s3n://path/to/examples.jar Similar to Spark Mesos has lots of properties that can be set to optimize the processing. You should refer to the Spark Configuration page (http://spark.apache.org/docs/latest/configuration.html) for more Information. Mesos run modes Spark can run on Mesos in two modes: Coarse Grained (default-mode): Spark will acquire a long running Mesos task on each machine. This offers a much cost of statup, but the resources will continue to be allocated to spark for the complete duration of the application. Fine Grained (deprecated): The fine grained mode is deprecated as in this case each mesos task is created per Spark task. The benefit of this is each application receives cores as per its requirements, but the initial bootstrapping might act as a deterrent for interactive applications. Key Spark on Mesos configuration properties While Spark has a number of properties that can be configured to optimize Spark processing, some of these properties are specific to Mesos. We'll look at few of those key properties here. Property Name Meaning/Default Value spark.mesos.coarse Setting it to true (default value), will run Mesos in coarse grained mode. Setting it to false will run it in fine-grained mode. spark.mesos.extra.cores This is more of an advertisement rather than allocation in order to improve parallelism. An executor will pretend that it has extra cores resulting in the driver sending it more work. Default=0 spark.mesos.mesosExecutor.cores Only works in fine grained mode. This specifies how many cores should be given to each Mesos executor. spark.mesos.executor.home Identifies the directory of Spark installation for the executors in Mesos. As discussed, you can specify this using spark.executor.uri as well, however if you have not specified it, you can specify it using this property. spark.mesos.executor.memoryOverhead The amount of memory (in MBs) to be allocated per executor. spark.mesos.uris A comma separated list of URIs to be downloaded when the driver or executor is launched by Mesos. spark.mesos.prinicipal The name of the principal used by Spark to authenticate itself with Mesos.   You can find other configuration properties at the Spark documentation page (http://spark.apache.org/docs/latest/running-on-mesos.html#spark-properties). To summarize, we covered the objective to get you started with running Spark on Mesos. To know more about Spark SQL, Spark Streaming, Machine Learning with Spark, you can refer to the book Learning Apache Spark 2.
Read more
  • 0
  • 0
  • 11523

article-image-recommendation-systems
Packt
07 Jul 2016
12 min read
Save for later

Recommendation Systems

Packt
07 Jul 2016
12 min read
 In this article, Pradeepta Mishra, the author of R Data Mining Blueprints, says that in this age of Internet, everything available over the Internet is not useful for everyone. Different companies and entities use different approaches in finding out relevant content for their audiences. People started building algorithms to construct relevance score, based on that, recommendation can be build and suggested to the users. From our day to day life, every time I see an image on Google, 3-4 other images are recommended to me by Google. Every time I look for some videos on YouTube, 10 more videos are recommended to me. Every time I visit Amazon to buy some products, 5-6 products are recommended to me. And every time I read one blog or article, a few more articles and blogs are recommended to me. This is an evidence of algorithmic forces at play to recommend certain things based on users’ preferences or choices, since the users’ time is precious and content available over the Internet is unlimited. Hence, a recommendation engine helps organizations customize their offerings based on user preferences so that the user need not have to spend time in exploring what is required. In this article, the reader will learn the implementation of product recommendation using R. (For more resources related to this topic, see here.) Practical project The dataset contains a sample of 5000 users from the anonymous ratings data from the Jester Online Joke Recommender System collected between April 1999 and May 2003 (Golberg, Roeder, Gupta, and Perkins 2001). The dataset contains ratings for 100 jokes on a scale from -10 to 10. All users in the dataset have rated 36 or more jokes. Let's load the recommenderlab library and the Jester5K dataset: > library("recommenderlab") > data(Jester5k) > Jester5k@data@Dimnames[2] [[1]] [1] "j1" "j2" "j3" "j4" "j5" "j6" "j7" "j8" "j9" [10] "j10" "j11" "j12" "j13" "j14" "j15" "j16" "j17" "j18" [19] "j19" "j20" "j21" "j22" "j23" "j24" "j25" "j26" "j27" [28] "j28" "j29" "j30" "j31" "j32" "j33" "j34" "j35" "j36" [37] "j37" "j38" "j39" "j40" "j41" "j42" "j43" "j44" "j45" [46] "j46" "j47" "j48" "j49" "j50" "j51" "j52" "j53" "j54" [55] "j55" "j56" "j57" "j58" "j59" "j60" "j61" "j62" "j63" [64] "j64" "j65" "j66" "j67" "j68" "j69" "j70" "j71" "j72" [73] "j73" "j74" "j75" "j76" "j77" "j78" "j79" "j80" "j81" [82] "j82" "j83" "j84" "j85" "j86" "j87" "j88" "j89" "j90" [91] "j91" "j92" "j93" "j94" "j95" "j96" "j97" "j98" "j99" [100] "j100" The following image shows the distribution of real ratings given by 2000 users. > data<-sample(Jester5k,2000) > hist(getRatings(data),breaks=100,col="blue") The input dataset contains the individual ratings; the normalization function reduces the individual rating bias by centering the row (which is a standard z-score transformation), subtracting each element from the mean, and then dividing by standard deviation. The following graph shows normalized ratings for the preceding dataset: > hist(getRatings(normalize(data)),breaks=100,col="blue4") To create a recommender system: A recommendation engine is created using the recommender() function. A new recommendation algorithm can be added by the user using the recommenderRegistry$get_entries() function: > recommenderRegistry$get_entries(dataType = "realRatingMatrix") $IBCF_realRatingMatrix Recommender method: IBCF Description: Recommender based on item-based collaborative filtering (real data). Parameters: k method normalize normalize_sim_matrix alpha na_as_zero minRating 1 30 Cosine center FALSE 0.5 FALSE NA $POPULAR_realRatingMatrix Recommender method: POPULAR Description: Recommender based on item popularity (real data). Parameters: None $RANDOM_realRatingMatrix Recommender method: RANDOM Description: Produce random recommendations (real ratings). Parameters: None $SVD_realRatingMatrix Recommender method: SVD Description: Recommender based on SVD approximation with column-mean imputation (real data). Parameters: k maxiter normalize minRating 1 10 100 center NA $SVDF_realRatingMatrix Recommender method: SVDF Description: Recommender based on Funk SVD with gradient descend (real data). Parameters: k gamma lambda min_epochs max_epochs min_improvement normalize 1 10 0.015 0.001 50 200 1e-06 center minRating verbose 1 NA FALSE $UBCF_realRatingMatrix Recommender method: UBCF Description: Recommender based on user-based collaborative filtering (real data). Parameters: method nn sample normalize minRating 1 cosine 25 FALSE center NA The preceding registry command helps in identifying the methods available in the recommenderlab parameters for the model. There are six different methods for implementing recommender systems, such as popular, item-based, user-based, PCA, random, and SVD. Let's start the recommendation engine using the popular method: > rc <- Recommender(Jester5k, method = "POPULAR") > rc Recommender of type 'POPULAR' for 'realRatingMatrix' learned using 5000 users. > names(getModel(rc)) [1] "topN" "ratings" [3] "minRating" "normalize" [5] "aggregationRatings" "aggregationPopularity" [7] "minRating" "verbose" > getModel(rc)$topN Recommendations as 'topNList' with n = 100 for 1 users. The objects such as top N, verbose, aggregation popularity, and so on, can be printed using names of the getmodel()command: recom <- predict(rc, Jester5k, n=5) recom To generate a recommendation, we can use the predict function against the same dataset and validate the accuracy of the predictive model. Here we are generating the top 5 recommended jokes to each of the users. The result of the prediction is as follows: > head(as(recom,"list")) $u2841 [1] "j89" "j72" "j76" "j88" "j83" $u15547 [1] "j89" "j93" "j76" "j88" "j91" $u15221 character(0) $u15573 character(0) $u21505 [1] "j89" "j72" "j93" "j76" "j88" $u15994 character(0) For the same Jester5K dataset, let's try to implement item-based collaborative filtering (IBCF): > rc <- Recommender(Jester5k, method = "IBCF") > rc Recommender of type 'IBCF' for 'realRatingMatrix' learned using 5000 users. > recom <- predict(rc, Jester5k, n=5) > recom Recommendations as 'topNList' with n = 5 for 5000 users. > head(as(recom,"list")) $u2841 [1] "j85" "j86" "j74" "j84" "j80" $u15547 [1] "j91" "j87" "j88" "j89" "j93" $u15221 character(0) $u15573 character(0) $u21505 [1] "j78" "j80" "j73" "j77" "j92" $u15994 character(0) The Principal component analysis (PCA) method is not applicable for real-rating-based datasets; this is because getting a correlation matrix and subsequent eigenvector and eigenvalue calculations would not be accurate. Hence we will not show its application. Next we are going to show how the random method works: > rc <- Recommender(Jester5k, method = "RANDOM") > rc Recommender of type 'RANDOM' for 'ratingMatrix' learned using 5000 users. > recom <- predict(rc, Jester5k, n=5) > recom Recommendations as 'topNList' with n = 5 for 5000 users. > head(as(recom,"list")) [[1]] [1] "j90" "j74" "j86" "j78" "j85" [[2]] [1] "j87" "j88" "j74" "j92" "j79" [[3]] character(0) [[4]] character(0) [[5]] [1] "j95" "j86" "j93" "j78" "j83" [[6]] character(0) In the recommendation engine, the SVD approach is used to predict the missing ratings so that a recommendation can be generated. Using the singular value decomposition (SVD) method, the following recommendation can be generated: > rc <- Recommender(Jester5k, method = "SVD") > rc Recommender of type 'SVD' for 'realRatingMatrix' learned using 5000 users. > recom <- predict(rc, Jester5k, n=5) > recom Recommendations as 'topNList' with n = 5 for 5000 users. > head(as(recom,"list")) $u2841 [1] "j74" "j71" "j84" "j79" "j80" $u15547 [1] "j89" "j93" "j76" "j81" "j88" $u15221 character(0) $u15573 character(0) $u21505 [1] "j80" "j73" "j100" "j72" "j78" $u15994 character(0) The result from user-based collaborative filtering is shown as follows: > rc <- Recommender(Jester5k, method = "UBCF") > rc Recommender of type 'UBCF' for 'realRatingMatrix' learned using 5000 users. > recom <- predict(rc, Jester5k, n=5) > recom Recommendations as 'topNList' with n = 5 for 5000 users. > head(as(recom,"list")) $u2841 [1] "j81" "j78" "j83" "j80" "j73" $u15547 [1] "j96" "j87" "j89" "j76" "j93" $u15221 character(0) $u15573 character(0) $u21505 [1] "j100" "j81" "j83" "j92" "j96" $u15994 character(0) Now let's compare the results obtained from all the five different algorithms except PCA (because PCA requires a binary dataset; it does not accept a real ratings matrix). Table 4: Comparison of results between different recommendation algorithms Popular IBCF Random method SVD UBCF > head(as(recom,"list")) > head(as(recom,"list")) > head(as(recom,"list")) > head(as(recom,"list")) > head(as(recom,"list")) $u2841 $u2841 [[1]] $u2841 $u2841 [1] "j89" "j72" "j76" "j88" "j83" [1] "j85" "j86" "j74" "j84" "j80" [1] "j90" "j74" "j86" "j78" "j85" [1] "j74" "j71" "j84" "j79" "j80" [1] "j81" "j78" "j83" "j80" "j73"           $u15547 $u15547 [[2]] $u15547 $u15547 [1] "j89" "j93" "j76" "j88" "j91" [1] "j91" "j87" "j88" "j89" "j93" [1] "j87" "j88" "j74" "j92" "j79" [1] "j89" "j93" "j76" "j81" "j88" [1] "j96" "j87" "j89" "j76" "j93"           $u15221 $u15221 [[3]] $u15221 $u15221 character(0) character(0) character(0) character(0) character(0)           $u15573 $u15573 [[4]] $u15573 $u15573 character(0) character(0) character(0) character(0) character(0)           $u21505 $u21505 [[5]] $u21505 $u21505 [1] "j89" "j72" "j93" "j76" "j88" [1] "j78" "j80" "j73" "j77" "j92" [1] "j95" "j86" "j93" "j78" "j83" [1] "j80"   "j73" "j100" "j72" "j78" [1] "j100" "j81" "j83" "j92" "j96"           $u15994 $u15994 [[6]] $u15994 $u15994 character(0) character(0) character(0) character(0) character(0)             One thing is clear from the above table. For users 15573 and 15221, none of the five methods generate recommendation. Hence it is important to look at methods to evaluate the recommendation results. To validate the accuracy of the model, let's implement accuracy measures and compare the accuracies of all the models. For the evaluation of the model results, the dataset is divided into 90% for training and 10% for testing the algorithm. The definition of a good rating is updated as 5: > e <- evaluationScheme(Jester5k, method="split", + train=0.9,given=15, goodRating=5) > e Evaluation scheme with 15 items given Method: 'split' with 1 run(s). Training set proportion: 0.900 Good ratings: >=5.000000 Data set: 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings. The following script is used to build the collaborative filtering model and apply it on a new dataset for predicting the ratings. Then the prediction accuracy is computed. The error matrix is shown as follows: > #User based collaborative filtering > r1 <- Recommender(getData(e, "train"), "UBCF") > #Item based collaborative filtering > r2 <- Recommender(getData(e, "train"), "IBCF") > #PCA based collaborative filtering > #r3 <- Recommender(getData(e, "train"), "PCA") > #POPULAR based collaborative filtering > r4 <- Recommender(getData(e, "train"), "POPULAR") > #RANDOM based collaborative filtering > r5 <- Recommender(getData(e, "train"), "RANDOM") > #SVD based collaborative filtering > r6 <- Recommender(getData(e, "train"), "SVD") > #Predicted Ratings > p1 <- predict(r1, getData(e, "known"), type="ratings") > p2 <- predict(r2, getData(e, "known"), type="ratings") > #p3 <- predict(r3, getData(e, "known"), type="ratings") > p4 <- predict(r4, getData(e, "known"), type="ratings") > p5 <- predict(r5, getData(e, "known"), type="ratings") > p6 <- predict(r6, getData(e, "known"), type="ratings") > #calculate the error between the prediction and > #the unknown part of the test data > error <- rbind( + calcPredictionAccuracy(p1, getData(e, "unknown")), + calcPredictionAccuracy(p2, getData(e, "unknown")), + #calcPredictionAccuracy(p3, getData(e, "unknown")), + calcPredictionAccuracy(p4, getData(e, "unknown")), + calcPredictionAccuracy(p5, getData(e, "unknown")), + calcPredictionAccuracy(p6, getData(e, "unknown")) + ) > rownames(error) <- c("UBCF","IBCF","POPULAR","RANDOM","SVD") > error RMSE MSE MAE UBCF 4.485571 20.12034 3.511709 IBCF 4.606355 21.21851 3.466738 POPULAR 4.509973 20.33985 3.548478 RANDOM 7.917373 62.68480 6.464369 SVD 4.653111 21.65144 3.679550 From the preceding result, UBCF has the lowest error in comparison to other recommendation methods. Here, to evaluate the results of the predictive model, we are using the k-fold cross-validation method. k is assumed to have been taken as 4: > #Evaluation of a top-N recommender algorithm > scheme <- evaluationScheme(Jester5k, method="cross", k=4, + given=3,goodRating=5) > scheme Evaluation scheme with 3 items given Method: 'cross-validation' with 4 run(s). Good ratings: >=5.000000 Data set: 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings. The result of the models from the evaluation scheme shows the runtime versus prediction time by different cross-validation results for different models. The result is shown as follows: > results <- evaluate(scheme, method="POPULAR", n=c(1,3,5,10,15,20)) POPULAR run fold/sample [model time/prediction time] 1 [0.14sec/2.27sec] 2 [0.16sec/2.2sec] 3 [0.14sec/2.24sec] 4 [0.14sec/2.23sec] > results <- evaluate(scheme, method="IBCF", n=c(1,3,5,10,15,20)) IBCF run fold/sample [model time/prediction time] 1 [0.4sec/0.38sec] 2 [0.41sec/0.37sec] 3 [0.42sec/0.38sec] 4 [0.43sec/0.37sec] > results <- evaluate(scheme, method="UBCF", n=c(1,3,5,10,15,20)) UBCF run fold/sample [model time/prediction time] 1 [0.13sec/6.31sec] 2 [0.14sec/6.47sec] 3 [0.15sec/6.21sec] 4 [0.13sec/6.18sec] > results <- evaluate(scheme, method="RANDOM", n=c(1,3,5,10,15,20)) RANDOM run fold/sample [model time/prediction time] 1 [0sec/0.27sec] 2 [0sec/0.26sec] 3 [0sec/0.27sec] 4 [0sec/0.26sec] > results <- evaluate(scheme, method="SVD", n=c(1,3,5,10,15,20)) SVD run fold/sample [model time/prediction time] 1 [0.36sec/0.36sec] 2 [0.35sec/0.36sec] 3 [0.33sec/0.36sec] 4 [0.36sec/0.36sec] The confusion matrix displays the level of accuracy provided by each of the models. We can estimate the accuracy measures such as precision, recall and TPR, FPR, and so on; the result is shown here: > getConfusionMatrix(results)[[1]] TP FP FN TN precision recall TPR FPR 1 0.2736 0.7264 17.2968 78.7032 0.2736000 0.01656597 0.01656597 0.008934588 3 0.8144 2.1856 16.7560 77.2440 0.2714667 0.05212659 0.05212659 0.027200530 5 1.3120 3.6880 16.2584 75.7416 0.2624000 0.08516269 0.08516269 0.046201487 10 2.6056 7.3944 14.9648 72.0352 0.2605600 0.16691259 0.16691259 0.092274243 15 3.7768 11.2232 13.7936 68.2064 0.2517867 0.24036802 0.24036802 0.139945095 20 4.8136 15.1864 12.7568 64.2432 0.2406800 0.30082509 0.30082509 0.189489883 Association rules as a method for recommendation engine, for building product recommendation in a retail/e-commerce scenario. Summary In this article, we discussed the way of recommending products to users based on similarities in their purchase patterns, content, item-to-item comparison and so on. So far, the accuracy is concerned, always the user-based collaborative filtering is giving better result in a real-rating-based matrix as an input. Similarly, the choice of methods for a specific use case is really difficult, so it is recommended to apply all six different methods. The best one should be selected automatically, and the recommendation should also get updates automatically. Resources for Article: Further resources on this subject: Data mining[article] Machine Learning with R[article] Machine learning and Python – the Dream Team[article]
Read more
  • 0
  • 0
  • 11478
article-image-slowly-changing-dimension-scd-type-6
Packt
30 Mar 2010
6 min read
Save for later

Slowly Changing Dimension (SCD) Type 6

Packt
30 Mar 2010
6 min read
The Example We will apply SCD’s to maintain the history of Product dimension, specifically the history of changes of Product's Product Group. The PRODUCT_SK column is the surrogate key of the Product dimension table. PRODUCT_SK PRODUCT_CODE PRODUCT_NAME PRODUCT_GROUP_CODE PRODUCT_GROUP_NAME 1 11 PENCIL 1 WRITING SUPPLY 2 22 PEN 1 WRITING SUPPLY 3 33 TONER 2 PRINTING SUPPLY 4 44 NOTEBOOK 4 NON ELECTRONIC SUPPL SCD Type 1 We will apply SCD Type 1 to the PENCIL product in the Product dimension table. Let’s say PENCIL changes its product group into 4. Effecting this change by applying SCD Type 1 just updates the existing row of PENCIL on its product group. We do not have record of its previous product group; in other words, we do not maintain its product group history. The updated PENCIL’s product group is shown highlighted in blue. PRODUCT_SK PRODUCT_CODE PRODUCT_NAME PRODUCT_GROUP_CODE PRODUCT_GROUP_NAME 1 11 PENCIL 4 NON ELECTRONIC SUPPLY 2 22 PEN 1 WRITING SUPPLY 3 33 TONER 2 PRINTING SUPPLY 4 44 NOTEBOOK 4 NON ELECTRONIC SUPPLY SCD Type 2 SCD Type 2 is essentially the opposite of Type 1. When we apply SCD Type 2, we never update or delete any existing product group. To apply SCD Type 2 we need an effective date and an expiry date. Effective date 31-Dec-99 means the row is not expired. It is the most current version of the product. PRODUCT_SK PRODUCT_CODE PRODUCT_NAME PRODUCT_GROUP_CODE PRODUCT_GROUP_NAME EFFECTIVE_DATE EXPIRY_DATE 1 11 PENCIL 1 WRITING SUPPLY 1-Jan-09 31-Dec-99 2 22 PEN 1 WRITING SUPPLY 1-Jan-09 31-Dec-99 3 33 TONER 2 PRINTING SUPPLY 1-Jan-09 31-Dec-99 4 44 NOTEBOOK 4 NON ELECTRONIC SUPPLY 1-Jan-09 31-Dec-99 Assuming the product group change of PENCIL is effective 1 April 2010, we update the expiry date of its existing row to 31 March 2010, one day before the effective date of the effective date of the change, and insert a new row that represents its new, current version. PRODUCT_SK PRODUCT_CODE PRODUCT_NAME PRODUCT_ GROUP _CODE PRODUCT_GROUP _NAME EFFECTIVE_DATE EXPIRY_DATE 1 11 PENCIL 1 WRITING SUPPLY 1-Jan-09 31-Mar-10 2 22 PEN 1 WRITING SUPPLY 1-Jan-09 31-Dec-99 3 33 TONER 2 PRINTING SUPPLY 1-Jan-09 31-Dec-99 4 44 NOTEBOOK 4 NON ELECTRONIC SUPPLY 1-Jan-09 31-Dec-99 5 11 PENCIL 4 NON ELECTRONIC SUPPLY 1-Apr-09 31-Dec-99 SCD Type 3 With SCD Type 3 we maintain history but in one record only. We have one column for each version of the product group. You need to have as many columns as the number of versions you want to keep. One of the most common SCD Type 3 applications is to maintain two versions of product group: the original version and the current version. When there is no product group change yet, the current product group is the same as the original product group. PRODUCT_SK PRODUCT_CODE PRODUCT_ NAME PRODUCT_ GROUP_ CODE PRODUCT_ GROUP_NAME EFFECTIVE_ DATE EXPIRY_ DATE   CURRENT_ PRODUCT_ GROUP_ CODE CURRENT_ PRODUCT_ GROUP_NAME 1 11 PENCIL 1 WRITING SUPPLY 1-Jan-09 31-Dec-99   1 WRITING SUPPLY 2 22 PEN 1 WRITING SUPPLY 1-Jan-09 31-Dec-99   1 WRITING SUPPLY 3 33 TONER 2 PRINTING SUPPLY 1-Jan-09 31-Dec-99   2 PRINTING SUPPLY 4 44 NOTEBOOK 4 NON ELECTRONIC SUPPLY 1-Jan-09 31-Dec-99     4 NON ELECTRONIC SUPPLY When the pencil’s product group changes, let’s say on 1 April 2010, we expire its original product group by changing the expiry date to a day earlier (30 March 2010), and replace its current product group to the new product group. PRODUCT_SK PRODUCT_CODE PRODUCT_ NAME PRODUCT_ GROUP_CODE PRODUCT_ GROUP_NAME EFFECTIVE_ DATE EXPIRY_ DATE   CURRENT_ PRODUCT_ GROUP_ CODE CURRENT_ PRODUCT_ GROUP_ NAME 1 11 PENCIL 1 WRITING SUPPLY 1-Jan-09 31-Mar-10     4 NON ELECTRONIC SUPPLY 2 22 PEN 1 WRITING SUPPLY 1-Jan-09 31-Dec-99   1 WRITING SUPPLY 3 33 TONER 2 PRINTING SUPPLY 1-Jan-09 31-Dec-99   2 PRINTING SUPPLY 4 44 NOTEBOOK 4 NON ELECTRONIC SUPPLY 1-Jan-09 31-Dec-99     4 NON ELECTRONIC SUPPLY When its product group changes again in the future, we will replace just the current product group with the new product group. The expiry date does not change. It gets updated once only the first time its product group changes. SCD Type 6 SCD Type 6 combines the three basic types. Type 6 is particularly applicable if you want to maintain complete history and would also like have an easy way to effect on current version. Let’s apply Type 6 instead of Type 3 only. We have applied Type 3 by having two versions of product group. When the pencil’s product group changes we update its existing current product group (that is Type 1 update). We also apply Type 2 by adding a new row. PRODUCT_SK PRODUCT_CODE PRODUCT_ NAME PRODUCT_ GROUP_ CODE PRODUCT_ GROUP_ NAME EFFECTIVE_ DATE EXPIRY_ DATE   CURRENT_ PRODUCT_ GROUP_ CODE CURRENT_ PRODUCT_ GROUP_ NAME 1 11 PENCIL 1 WRITING SUPPLY 1-Jan-09 31-Mar-10     4 NON ELECTRONIC SUPPLY 2 22 PEN 1 WRITING SUPPLY 1-Jan-09 31-Dec-99   1 WRITING SUPPLY 3 33 TONER 2 PRINTING SUPPLY 1-Jan-09 31-Dec-99   2 PRINTING SUPPLY 4 44 NOTEBOOK 4 NON ELECTRONIC SUPPLY 1-Jan-09 31-Dec-99     4 NON ELECTRONIC SUPPLY 5 11 PENCIL 4 NON ELECTRONIC SUPPLY 1-Apr-10 31-Dec-99     4 NON ELECTRONIC SUPPLY On the next pencil’s product group change (1 July 2010), we will again apply all three SCD types. PRODUCT _SK PRODUCT _CODE PRODUCT _NAME PRODUCT_ GROUP_ CODE PRODUCT_ GROUP _NAME EFFECTIVE_ DATE EXPIRY_ DATE   CURRENT_ PRODUCT_ GROUP_ CODE CURRENT_ PRODUCT_ GROUP_ NAME 1 11 PENCIL 1 WRITING SUPPLY 1-Jan-09 31-Mar-10     5 LEGACY SUPPLY 2 22 PEN 1 WRITING SUPPLY 1-Jan-09 31-Dec-99   1 WRITING SUPPLY 3 33 TONER 2 PRINTING SUPPLY 1-Jan-09 31-Dec-99   2 PRINTING SUPPLY 4 44 NOTEBOOK 4 NON ELECTRONIC SUPPLY 1-Jan-09 31-Dec-99     4 NON ELECTRONIC SUPPLY 5 11 PENCIL 4 NON ELECTRONIC SUPPLY 1-Apr-10 30-Jun-10   5 LEGACY SUPPLY 6 11 PENCIL 5 LEGACY SUPPLY 1-Jul-10 31-Dec-99   5 LEGACY SUPPLY QUERY Let’s next see how our Type 6 in the Product dimension works on a sales fact. (In a real sales fact data you will have some other dimensions, meaning the fact table will have more surrogate key columns than just the product surrogate key) If our interest is in the current version, our SQL query will use the current product group column. An example SQL query will look like: SELECT current_product_group_name, SUM(sales_amt)FROM sales_fact s, product_dim pWHEREs.product_sk = p.product_skAND product_name = ‘PENCIL’GROUP BY current_product_group_code The output of the query will be: The reason of applying SCD Type 2 is to have a complete history that tracks changes. SQL queries that take into account dimension history use the product group column: SELECT product_group_name, SUM(sales_amt)FROM sales_fact s, product_dim p, date_dim dWHEREs.product_sk = p.product_skAND product_name = ‘PENCIL’GROUP BY product_group_code The output of the query will be: SUMMARY This article discusses what SCD Type 6 is, when to apply it, and how it works. The name Type 6 comes from the ‘sum’ of the three basic SCD types (6 = 1 + 2 + 3).
Read more
  • 0
  • 0
  • 11469

article-image-how-to-compute-interpolation-in-scipy
Pravin Dhandre
05 Mar 2018
8 min read
Save for later

How to Compute Interpolation in SciPy

Pravin Dhandre
05 Mar 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book provides numerous recipes in mastering common tasks related to SciPy and associated libraries such as NumPy, pandas, and matplotlib.[/box] In today’s tutorial, we will see how to compute and solve polynomial, univariate interpolations using SciPy with detailed process and instructions. In this recipe, we will look at how to compute data polynomial interpolation by applying some important methods which are discussed in detail in the coming How to do it... section. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… Let's get started. In the following steps, we will explain how to compute a polynomial interpolation and the things we need to know: They require the following parameters: points: An ndarray of floats, shape (n, D) data point coordinates. It can be either an array of shape (n, D) or a tuple of ndim arrays. values: An ndarray of float or complex shape (n,) data values. xi: A 2D ndarray of float or tuple of 1D array, shape (M, D). Points at which to interpolate data. method: A {'linear', 'nearest', 'cubic'}—This is an optional method of interpolation. One of the nearest return value is at the data point closest to the point of interpolation. See NearestNDInterpolator for more details. linear tessellates the input point set to n-dimensional simplices, and interpolates linearly on each simplex. See LinearNDInterpolator for more details. cubic (1D): Returns the value determined from a cubic spline. cubic (2D): Returns the value determined from a piecewise cubic, continuously differentiable (C1), and approximately curvature-minimizing polynomial surface. See CloughTocher2DInterpolator for more details. fill_value: float; optional. It is the value used to fill in for requested points outside of the convex hull of the input points. If it is not provided, then the default is nan. This option has no effect on the nearest method. rescale: bool; optional. Rescale points to the unit cube before performing interpolation. This is useful if some of the input dimensions have non-commensurable units and differ by many orders of magnitude. How it works… One can see that the exact result is reproduced by all of the methods to some degree, but for this smooth function, the piecewise cubic interpolant gives the best results: import matplotlib.pyplot as plt import numpy as np methods = [None, 'none', 'nearest', 'bilinear', 'bicubic', 'spline16', 'spline36', 'hanning', 'hamming', 'hermite', 'kaiser', 'quadric', 'catrom', 'gaussian', 'bessel', 'mitchell', 'sinc', 'lanczos'] # Fixing random state for reproducibility np.random.seed(19680801) grid = np.random.rand(4, 4) fig, axes = plt.subplots(3, 6, figsize=(12, 6), subplot_kw={'xticks': [], 'yticks': []}) fig.subplots_adjust(hspace=0.3, wspace=0.05) for ax, interp_method in zip(axes.flat, methods): ax.imshow(grid, interpolation=interp_method, cmap='viridis') ax.set_title(interp_method) plt.show() This is the result of the execution: Univariate interpolation In the next section, we will look at how to solve univariate interpolation. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… The following table summarizes the different univariate interpolation modes coded in SciPy, together with the processes that we may use to resolve them: Finding a cubic spline that interpolates a set of data In this recipe, we will look at how to find a cubic spline that interpolates with the main method of spline. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… We can use the following functions to solve the problems with this parameter: x: array_like, shape (n,). A 1D array containing values of the independent variable. The values must be real, finite, and in strictly increasing order. y: array_like. An array containing values of the dependent variable. It can have an arbitrary number of dimensions, but the length along axis must match the length of x. The values must be finite. axis: int; optional. The axis along which y is assumed to be varying, meaning for x[i], the corresponding values are np.take(y, i, axis=axis). The default is 0. bc_type: String or two-tuple; optional. Boundary condition type. Two additional equations, given by the boundary conditions, are required to determine all coefficients of polynomials on each segment. Refer to: https:/​/​docs.​scipy.​org/doc/​scipy-​0.​19.​1/​reference/​generated/​scipy.​interpolate.​CubicSpline.html#r59. If bc_type is a string, then the specified condition will be applied at both ends of a spline. The available conditions are: not-a-knot (default): The first and second segment at a curve end are the same polynomial. This is a good default when there is no information about boundary conditions. periodic: The interpolated function is assumed to be periodic in the period x[-1] - x[0]. The first and last value of y must be identical: y[0] == y[-1]. This boundary condition will result in y'[0] == y'[-1] and y''[0] == y''[-1]. clamped: The first derivatives at the curve ends are zero. Assuming there is a 1D y, bc_type=((1, 0.0), (1, 0.0)) is the same condition. natural: The second derivatives at the curve ends are zero. Assuming there is a 1D y, bc_type=((2, 0.0), (2, 0.0)) is the same condition. If bc_type is two-tuple, the first and the second value will be applied at the curve's start and end respectively. The tuple value can be one of the previously mentioned strings (except periodic) or a tuple (order, deriv_values), allowing us to specify arbitrary derivatives at curve ends: order: The derivative order; it is 1 or 2. deriv_value: An array_like containing derivative values. The shape must be the same as y, excluding the axis dimension. For example, if y is 1D, then deriv_value must be a scalar. If y is 3D with shape (n0, n1, n2) and axis=2, then deriv_value must be 2D and have the shape (n0, n1). extrapolate: {bool, 'periodic', None}; optional. bool, determines whether or not to extrapolate to out-of-bounds points based on first and last intervals, or to return NaNs. periodic, periodic extrapolation is used. If none (default), extrapolate is set to periodic for bc_type='periodic' and to True otherwise. How it works... We have the following example: %pylab inline from scipy.interpolate import CubicSpline import matplotlib.pyplot as plt x = np.arange(10) y = np.sin(x) cs = CubicSpline(x, y) xs = np.arange(-0.5, 9.6, 0.1) plt.figure(figsize=(6.5, 4)) plt.plot(x, y, 'o', label='data') plt.plot(xs, np.sin(xs), label='true') plt.plot(xs, cs(xs), label="S") plt.plot(xs, cs(xs, 1), label="S'") plt.plot(xs, cs(xs, 2), label="S''") plt.plot(xs, cs(xs, 3), label="S'''") plt.xlim(-0.5, 9.5) plt.legend(loc='lower left', ncol=2) plt.show() We can see the result here: We see the next example: theta = 2 * np.pi * np.linspace(0, 1, 5) y = np.c_[np.cos(theta), np.sin(theta)] cs = CubicSpline(theta, y, bc_type='periodic') print("ds/dx={:.1f} ds/dy={:.1f}".format(cs(0, 1)[0], cs(0, 1)[1])) x=0.0 ds/dy=1.0 xs = 2 * np.pi * np.linspace(0, 1, 100) plt.figure(figsize=(6.5, 4)) plt.plot(y[:, 0], y[:, 1], 'o', label='data') plt.plot(np.cos(xs), np.sin(xs), label='true') plt.plot(cs(xs)[:, 0], cs(xs)[:, 1], label='spline') plt.axes().set_aspect('equal') plt.legend(loc='center') plt.show() In the following screenshot, we can see the final result: Defining a B-spline for a given set of control points In the next section, we will look at how to solve B-splines given some controlled data. Getting ready We need to follow some instructions and install the prerequisites. How to do it… Univariate the spline in the B-spline basis Execute the following: S(x)=∑j=0n-1cjBj,k;t(x)S(x)=∑j=0n-1cjBj,k;t(x) Where it's Bj,k;tBj,k;t are B-spline basis functions of degree k and knots t We can use the following parameters: How it works ... Here, we construct a quadratic spline function on the base interval 2 <= x <= 4 and compare it with the naive way of evaluating the spline: from scipy import interpolate import numpy as np import matplotlib.pyplot as plt # sampling x = np.linspace(0, 10, 10) y = np.sin(x) # spline trough all the sampled points tck = interpolate.splrep(x, y) x2 = np.linspace(0, 10, 200) y2 = interpolate.splev(x2, tck) # spline with all the middle points as knots (not working yet) # knots = x[1:-1] # it should be something like this knots = np.array([x[1]]) # not working with above line and just seeing what this line does weights = np.concatenate(([1],np.ones(x.shape[0]-2)*.01,[1])) tck = interpolate.splrep(x, y, t=knots, w=weights) x3 = np.linspace(0, 10, 200) y3 = interpolate.splev(x2, tck) # plot plt.plot(x, y, 'go', x2, y2, 'b', x3, y3,'r') plt.show() Note that outside of the base interval, results differ. This is because BSpline extrapolates the first and last polynomial pieces of B-spline functions active on the base interval. This is the result of solving the problem: We successfully compute numerical computation and find interpolation function using polynomial and univariate interpolation coded in SciPy. If you found this tutorial useful, do check out the book SciPy Recipes to get quick recipes for performing other mathematical operations like differential equation, K-means and Discrete Fourier Transform.
Read more
  • 0
  • 0
  • 11456

article-image-how-to-execute-a-search-query-in-elasticsearch
Sugandha Lahoti
25 Jan 2018
9 min read
Save for later

How to execute a search query in ElasticSearch

Sugandha Lahoti
25 Jan 2018
9 min read
[box type="note" align="" class="" width=""]This post is an excerpt from a book authored by Alberto Paro, titled Elasticsearch 5.x Cookbook. It has over 170 advance recipes to search, analyze, deploy, manage, and monitor data effectively with Elasticsearch 5.x[/box] In this article we see how to execute and view a search operation in ElasticSearch. Elasticsearch was born as a search engine. It’s main purpose is to process queries and give results. In this article, we'll see that a search in Elasticsearch is not only limited to matching documents, but it can also calculate additional information required to improve the search quality. All the codes in this article are available on PacktPub or GitHub. These are the scripts to initialize all the required data. Getting ready You will need an up-and-running Elasticsearch installation. To execute curl via a command line, you will also need to install curl for your operating system. To correctly execute the following commands you will need an index populated with the chapter_05/populate_query.sh script available in the online code. The mapping used in all the article queries and searches is the following: { "mappings": { "test-type": { "properties": { "pos": { "type": "integer", "store": "yes" }, "uuid": { "store": "yes", "type": "keyword" }, "parsedtext": { "term_vector": "with_positions_offsets", "store": "yes", "type": "text" }, "name": { "term_vector": "with_positions_offsets", "store": "yes", "fielddata": true, "type": "text", "fields": { "raw": { "type": "keyword" } } }, "title": { "term_vector": "with_positions_offsets", "store": "yes", "type": "text", "fielddata": true, "fields": { "raw": { "type": "keyword" } } } } }, "test-type2": { "_parent": { "type": "test-type" } } } } How to do it To execute the search and view the results, we will perform the following steps: From the command line, we can execute a search as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search' -d '{"query":{"match_all":{}}}' In this case, we have used a match_all query that means return all the documents.    If everything works, the command will return the following: { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 1.0, "hits" : [ { "_index" : "test-index", "_type" : "test-type", "_id" : "1", "_score" : 1.0, "_source" : {"position": 1, "parsedtext": "Joe Testere nice guy", "name": "Joe Tester", "uuid": "11111"} }, { "_index" : "test-index", "_type" : "test-type", "_id" : "2", "_score" : 1.0, "_source" : {"position": 2, "parsedtext": "Bill Testere nice guy", "name": "Bill Baloney", "uuid": "22222"} }, { "_index" : "test-index", "_type" : "test-type", "_id" : "3", "_score" : 1.0, "_source" : {"position": 3, "parsedtext": "Bill is notn nice guy", "name": "Bill Clinton", "uuid": "33333"} } ] } }    These results contain a lot of information: took is the milliseconds of time required to execute the query. time_out indicates whether a timeout occurred during the search. This is related to the timeout parameter of the search. If a timeout occurs, you will get partial or no results. _shards is the status of shards divided into: total, which is the number of shards. successful, which is the number of shards in which the query was successful. failed, which is the number of shards in which the query failed, because some error or exception occurred during the query. hits are the results which are composed of the following: total is the number of documents that match the query. max_score is the match score of first document. It is usually one, if no match scoring was computed, for example in sorting or filtering. Hits which is a list of result documents. The resulting document has a lot of fields that are always available and others that depend on search parameters. The most important fields are as follows: _index: The index field contains the document _type: The type of the document _id: This is the ID of the document _source(this is the default field returned, but it can be disabled): the document source _score: This is the query score of the document sort: If the document is sorted, values that are used for sorting highlight: Highlighted segments if highlighting was requested fields: Some fields can be retrieved without needing to fetch all the source objects How it works The HTTP method to execute a search is GET (although POST also works); the REST endpoints are as follows: http://<server>/_search http://<server>/<index_name(s)>/_search http://<server>/<index_name(s)>/<type_name(s)>/_search Note: Not all the HTTP clients allow you to send data via a GET call, so the best practice, if you need to send body data, is to use the POST call. Multi indices and types are comma separated. If an index or a type is defined, the search is limited only to them. One or more aliases can be used as index names. The core query is usually contained in the body of the GET/POST call, but a lot of options can also be expressed as URI query parameters, such as the following: q: This is the query string to do simple string queries, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=uuid:11111' df: This is the default field to be used within the query, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? df=uuid&q=11111' from(the default value is 0): The start index of the hits. size(the default value is 10): The number of hits to be returned. analyzer: The default analyzer to be used. default_operator(the default value is OR): This can be set to AND or OR. explain: This allows the user to return information about how the score is calculated, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=parsedtext:joe&explain=true' stored_fields: These allows the user to define fields that must be returned, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? q=parsedtext:joe&stored_fields=name' sort(the default value is score): This allows the user to change the documents in  order. Sort is ascendant by default; if you need to change the order, add desc to the field, as follows: curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search? sort=name.raw:desc' timeout(not active by default): This defines the timeout for the search. Elasticsearch tries to collect results until a timeout. If a timeout is fired, all the hits accumulated are returned. search_type: This defines the search strategy. A reference is available in the online Elasticsearch documentation at https://www.elastic.co/guide/en/elas ticsearch/reference/current/search-request-search-type.html. track_scores(the default value is false): If true, this tracks the score and allows it to be returned with the hits. It's used in conjunction with sort, because sorting by default prevents the return of a match score. pretty (the default value is false): If true, the results will be pretty printed. Generally, the query, contained in the body of the search, is a JSON object. The body of the search is the core of Elasticsearch's search functionalities; the list of search capabilities extends in every release. For the current version (5.x) of Elasticsearch, the available parameters are as follows: query: This contains the query to be executed. Later in this chapter, we will see how to create different kinds of queries to cover several scenarios. from: This allows the user to control pagination. The from parameter defines the start position of the hits to be returned (default 0) and size (default 10). Note: The pagination is applied to the currently returned search results. Firing the same query can bring different results if a lot of records have the same score or a new document is ingested. If you need to process all the result documents without repetition, you need to execute scan or scroll queries. sort: This allows the user to change the order of the matched documents. post_filter: This allows the user to filter out the query results without affecting the aggregation count. It's usually used for filtering by facet values. _source: This allows the user to control the returned source. It can be disabled (false), partially returned (obj.*) or use multiple exclude/include rules. This functionality can be used instead of fields to return values (for complete coverage of this, take a look at the online Elasticsearch reference at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/ search-request-source-filtering.html). fielddata_fields: This allows the user to return a field data representation of the field. stored_fields: This controls the fields to be returned. Note: Returning only the required fields reduces the network and memory usage, improving the performance. The suggested way to retrieve custom fields is to use the _source filtering function because it doesn't need to use Elasticsearch's extra resources. aggregations/aggs: These control the aggregation layer analytics. These will be discussed in the next chapter. index_boost: This allows the user to define the per-index boost value. It is used to increase/decrease the score of results in boosted indices. highlighting: This allows the user to define fields and settings to be used for calculating a query abstract. version(the default value false): This adds the version of a document in the results. rescore: This allows the user to define an extra query to be used in the score to improve the quality of the results. The rescore query is executed on the hits that match the first query and filter. min_score: If this is given, all the result documents that have a score lower than this value are rejected. explain: This returns information on how the TD/IF score is calculated for a particular document. script_fields: This defines a script that computes extra fields via scripting to be returned with a hit. suggest: If given a query and a field, this returns the most significant terms related to this query. This parameter allows the user to implement the Google- like do you mean functionality. search_type: This defines how Elasticsearch should process a query. scroll: This controls the scrolling in scroll/scan queries. The scroll allows the user to have an Elasticsearch equivalent of a DBMS cursor. _name: This allows returns for every hit that matches the named queries. It's very useful if you have a Boolean and you want the name of the matched query. search_after: This allows the user to skip results using the most efficient way of scrolling. preference: This allows the user to select which shard/s to use for executing the query. We saw how to execute a search in ElasticSearch and also learnt about how it works. To know more on how to perform other operations in ElasticSearch check out the book Elasticsearch 5.x Cookbook.  
Read more
  • 0
  • 0
  • 11339
article-image-uk-online-harms-white-paper-divides-internet-puts-tech-companies-government-crosshairs
Fatema Patrawala
10 Apr 2019
10 min read
Save for later

Online Safety vs Free Speech: UK’s "Online Harms" white paper divides the internet and puts tech companies in government crosshairs

Fatema Patrawala
10 Apr 2019
10 min read
The internet is an integral part of everyday life for so many people. It has definitely added a new dimension to the spaces of imagination in which we all live. But it seems the problems of the offline world have moved there, too. As the internet continues to grow and transform our lives, often for the better, we should not ignore the very real harms which people face online every day. And the lawmakers around the world are taking decisive action to make people safer online. On Monday, Europe drafted EU Regulation on preventing the dissemination of terrorist content online. Last week, the Australian parliament passed legislation to crack down on violent videos on social media. Recently Sen. Elizabeth Warren, US 2020 presidential hopeful proposed to build strong anti-trust laws and break big tech companies like Amazon, Google, Facebook and Apple. On 3rd April, Elizabeth introduced Corporate Executive Accountability Act, a new piece of legislation that would make it easier to criminally charge company executives when Americans’ personal data is breached. Last year, the German parliament enacted the NetzDG law, requiring large social media sites to remove posts that violate certain provisions of the German code, including broad prohibitions on “defamation of religion,” “hate speech,” and “insult.” And here’s yet another tech regulation announcement on Monday, a white paper on online harms was announced by the UK government. The Department for Digital, Culture, Media and Sport (DCMS) has proposed an independent watchdog that will write a "code of practice" for tech companies. According to Jeremy Wright, Secretary of State for Digital, Media & Sport and Sajid Javid, Home Secretary, “nearly nine in ten UK adults and 99% of 12 to 15 year olds are online. Two thirds of adults in the UK are concerned about content online, and close to half say they have seen hateful content in the past year. The tragic recent events in New Zealand show just how quickly horrific terrorist and extremist content can spread online.” Further they emphasized on not allowing such harmful behaviours and content to undermine the significant benefits that the digital revolution can offer. The white paper therefore puts forward ambitious plans for a new system of accountability and oversight for tech companies, moving far beyond self-regulation. It includes a new regulatory framework for online safety which will clarify companies’ responsibilities to keep UK users safer online with the most robust action to counter illegal content and activity. The paper suggests 3 major steps for tech regulation: establishing an independent regulator that can write a "code of practice" for social networks and internet companies giving the regulator enforcement powers including the ability to fine companies that break the rules considering additional enforcement powers such as the ability to fine company executives and force internet service providers to block sites that break the rules Outlining the proposals, Culture Secretary Jeremy Wright discussed the fine percentage with BBC UK, "If you look at the fines available to the Information Commissioner around the GDPR rules, that could be up to 4% of company's turnover... we think we should be looking at something comparable here." What are the kind of 'online harms' cited in the paper? The paper cover a range of issues that are clearly defined in law such as spreading terrorist content, child sex abuse, so-called revenge pornography, hate crimes, harassment and the sale of illegal goods. It also covers harmful behaviour that has a less clear legal definition such as cyber-bullying, trolling and the spread of fake news and disinformation. The paper cites that in 2018 online CSEA (Child Sexual Exploitation and Abuse) reported over 18.4 million referrals of child sexual abuse material by US tech companies to the National Center for Missing and Exploited Children (NCMEC). Out of those, there were 113, 948 UK-related referrals in 2018, up from 82,109 in 2017. In the third quarter of 2018, Facebook reported removing 8.7 million pieces of content globally for breaching policies on child nudity and sexual exploitation. Another type of online harm occurs when terrorists use online services to spread their vile propaganda and mobilise support. Paper emphasizes that terrorist content online threatens the UK’s national security and the safety of the public. Giving an example of the five terrorist attacks in the UK during 2017, had an online element. And online terrorist content remains a feature of contemporary radicalisation. It is seen across terrorist investigations, including cases where suspects have become very quickly radicalised to the point of planning attacks. This is partly as a result of the continued availability and deliberately attractive format of the terrorist material they are accessing online. Further it suggests that social networks must tackle material that advocates self-harm and suicide, which became a prominent issue after 14-year-old Molly Russell took her own life in 2017. After she died her family found distressing material about depression and suicide on her Instagram account. Molly's father Ian Russell holds the social media giant partly responsible for her death. Home Secretary Sajid Javid said tech giants and social media companies had a moral duty "to protect the young people they profit from". Despite our repeated calls to action, harmful and illegal content - including child abuse and terrorism - is still too readily available online.” What does the new proposal suggest to tackle online harm The paper calls for an independent regulator to hold internet companies to account. While it did not specify whether a new body will be established, or an existing one will be handed new powers. The regulator will define a "code of best practice" that social networks and internet companies must adhere to. It applies to tech companies like Facebook, Twitter and Google, and the rules would also apply to messaging services such as Whatsapp, Snapchat and cloud storage services. The regulator will have the power to fine companies and publish notices naming and shaming those that break the rules. The paper suggests it is also considering fines for individual company executives and making search engines remove links to offending websites and also consulting over blocking harmful websites. Another area discussed in the paper is about developing a culture of transparency, trust and accountability as a critical element of the new regulatory framework. The regulator will have the power to require annual transparency reports from companies in scope, outlining the prevalence of harmful content on their platforms and what measures they are taking to address this. These reports will be published online by the regulator, so that users can make informed decisions about online use. Additionally it suggests the spread of fake news could be tackled by forcing social networks to employ fact-checkers and promote legitimate news sources. How it plans to deploy technology as a part of solution The paper mentions that companies should invest in the development of safety technologies to reduce the burden on users to stay safe online. As in November 2018, the Home Secretary of UK co-hosted a hackathon with five major technology companies to develop a new tool to identify online grooming. So they have proposed this tool to be licensed for free to other companies, and plan more such innovative and collaborative efforts with them. The government also plans to work with the industry and civil society to develop a safety by design framework, linking up with existing legal obligations around data protection by design and secure by design principles. This will make it easier for startups and small businesses to embed safety during the development or update of products and services. They also plan to understand how AI can be best used to detect, measure and counter online harms, while ensuring its deployment remains safe and ethical. A new project led by Turing is setting out to address this issue. The ‘Hate Speech: Measures and Counter-measures’ project will use a mix of natural language processing techniques and qualitative analyses to create tools which identify and categorize different strengths and types of online hate speech. Other plans include launching of online safety apps which will combine state-of-the-art machine-learning technology to track children’s activity on their smartphone with the ability for children to self-report their emotional state. Why is the white paper receiving critical comments Though the paper seems to be a welcome step towards a sane internet regulation and looks sensible at the first glance. In some cases it has been regarded as too ambitious and unrealistically feeble. It reflects the conflicting political pressures under which it has been generated. TechUK, an umbrella group representing the UK's technology industry, said the government must be "clear about how trade-offs are balanced between harm prevention and fundamental rights". Jim Killock, executive director of Open Rights Group, said the government's proposals would "create state regulation of the speech of millions of British citizens". Matthew Lesh, head of research at free market think tank the Adam Smith Institute, went further saying "The government should be ashamed of themselves for leading the western world in internet censorship. The proposals are a historic attack on freedom of speech and the free press. At a time when Britain is criticising violations of freedom of expression in states like Iran, China and Russia, we should not be undermining our freedom at home." No one doubts the harm done by child sexual abuse or terrorist propaganda online, but these things are already illegal. The difficulty is its enforcement, which the white paper does nothing to address. Effective enforcement would demand a great deal of money and human time. The present system relies on a mixture of human reporting and algorithms. The algorithms can be fooled without too much trouble: 300,000 of the 1.5m copies of the Christchurch terrorist videos that were uploaded to Facebook within 24 hours of the crime were undetected by automated systems. Apart from this there is a criticism about the vision of the white paper which says it wants "A free, open and secure internet with freedom of expression online" "where companies take effective steps to keep their users safe". But it is actually not explained how it is going to protect free expression and seems to be a contradiction to the regulation. https://twitter.com/jimkillock/status/1115253155007205377 Beyond this, there is a conceptual problem. Much of the harm done on and by social media does not come from deliberate criminality, but from ordinary people released from the constraints of civility. It is here that the white paper fails most seriously. It talks about material – such as “intimidation, disinformation, the advocacy of self-harm” – that is harmful but not illegal yet proposes to regulate it in the same way as material which is both. Even leaving aside politically motivated disinformation, this is an area where much deeper and clearer thought is needed. https://twitter.com/guy_herbert/status/1115180765128667137 There is no doubt that some forms of disinformation do serious harms both to individuals and to society as a whole. And regulating the internet is necessary, but it won’t be easy or cheap. Too much of this white paper looks like an attempt to find cheap and easy solutions to really hard questions. Tech companies in EU to face strict regulation on Terrorist content: One hour take down limit; Upload filters and private Terms of Service Tech regulation to an extent of sentence jail: Australia’s ‘Sharing of Abhorrent Violent Material Bill’ to Warren’s ‘Corporate Executive Accountability Act’ How social media enabled and amplified the Christchurch terrorist attack  
Read more
  • 0
  • 0
  • 11307

article-image-deep-learning-torch
Preetham Sreenivas
29 Sep 2016
10 min read
Save for later

Deep Learning with Torch

Preetham Sreenivas
29 Sep 2016
10 min read
Torch is a scientific computing framework built on top of Lua[JIT]. The nn package and the ecosystem around it provide a very powerful framework for building deep learning models, striking a perfect balance between speed and flexibility. It is used at Facebook AI Research(FAIR), Twitter Cortex, DeepMind, Yann LeCun's group at NYU, Fei-Fei Li's at Stanford, and many more industrial and academic labs. If you are like me, and don't like writing equations for backpropagation every time you want to try a simple model, Torch is a great solution. With Torch, you can also do pretty much anything you can imagine, whether that is writing custom loss functions, dreaming up an arbitrary acyclic graph network, or even using multiple GPUs or loading pre-trained models on imagenet from caffe model-zoo (yes, you can load models trained in caffe with a single line). Without further ado, let's jump right into the awesome world of deep learning. Prerequisites Some knowledge of deep learning—A Primer, Bengio's deep learning book, Hinton's Coursera course. A bit of Lua. Its syntax is very C-like and can be picked up fairly quickly if you know Python or JavaScript—Learn Lua in 15 minutes, Torch For Numpy Users. A machine with Torch installed since this is intended to be hands-on. On Ubuntu 12+ and Mac OS X, installing Torch looks like this: # in a terminal, run the commands WITHOUT sudo $ git clone https://github.com/torch/distro.git ~/torch --recursive $ cd ~/torch; bash install-deps; $ ./install.sh # On Linux with bash $ source ~/.bashrc # On OSX or in Linux with no bash. $ source ~/.profile Once you’ve installed Torch, you can run a Torch script using: $ th script.lua # alternatively you can fire up a terminal torch interpreter using th -i $ th -i # and run multiple scripts one by one, the variables will be accessible to other scripts > dofile 'script1.lua' > dofile 'script2.lua' > print(variable) -- variable from either of these scripts. The sections below are very code intensive, but you can run these commands from Torch's terminal interpreter. $th -i Building a Model: The Basics A module is the basic building block of any Torch model. It has forward and backward methods for forward and backward passes of backpropagation. You can combine them using containers, and of course, calling forward and backward on containers propagates inputs and gradients correctly. -- A simple mlp model with sigmoids require 'nn' linear1 = nn.Linear(100,10) -- A linear layer Module linear2 = nn.Linear(10,2) -- You can combine modulues using containers, sequential is the most used one model = nn.Sequential() -- A container model:add(linear1) model:add(nn.Sigmoid()) model:add(linear2) model:add(nn.Sigmoid()) -- the forward step input = torch.rand(100) target = torch.rand(2) output = linear:forward(input) Now we need a criterion to measure how well our model is performing, in other words, a loss function. nn.Criterion is the abstract class that all loss functions inherit. It provides forward and backward methods, computing loss and gradients respectively. Torch provides most of the commonly used criterions out of the box. It isn't much of an effort to write your own either. criterion = nn.MSECriterioin() -- mean squared error criterion loss = criterion:forward(output,target) gradientsAtOutput = criterion:backward(output,target) -- To perform the backprop step, we need to pass these gradients to the backward -- method of the model gradAtInput = model:backward(input,gradientsAtOutput) lr = 0.1 -- learning rate for our model model:updateParameters(lr) -- updates the parameters using the lr parameter. The updateParameters method just subtracts the model parameters by gradients scaled by the learning rate. This is the vanilla stochastic gradient descent. Typically, the updates we do are more complex. For example, if we want to use momentum, we need to keep a track of updates we did in the previous epoch. There are a lot more fancy optimization schemes such as RMSProp, adam, adagrad, and L-BFGS that do more complex things like adapting learning rate, momentum factor, and so on. The optim package provides optimization routines out of the box. Dataset We'll use the German Traffic Sign Recognition Benchmark(GTSRB) dataset. This dataset has 43 classes of traffic signs of varying sizes, illuminations and occlusions. There are 39,000 training images and 12,000 test images. Traffic signs in each of the images are not centered and they have a 10% border around them. I have included a shell script for downloading the data along with the code for this tutorial in this github repo.[1] git clone https://github.com/preethamsp/tutorial.gtsrb.torch.git cd tutorial.gtsrb.torch/datasets bash download_gtsrb.sh Model Let's build a downsized vgg style model with what we've learned. function createModel() require 'nn' nbClasses = 43 local net = nn.Sequential() --[[building block: adds a convolution layer, batch norm layer and a relu activation to the net]]-- function ConvBNReLU(nInputPlane, nOutputPlane) The code in the repo is much more polished than the snippets in the tutorial. It is modular and allows you to change the model and/or datasets easily. -- kernel size = (3,3), stride = (1,1), padding = (1,1) net:add(nn.SpatialConvolution(nInputPlane, nOutputPlane, 3,3, 1,1, 1,1)) net:add(nn.SpatialBatchNormalization(nOutputPlane,1e-3)) net:add(nn.ReLU(true)) end ConvBNReLU(3,32) ConvBNReLU(32,32) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) ConvBNReLU(32,64) ConvBNReLU(64,64) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) ConvBNReLU(64,128) ConvBNReLU(128,128) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) net:add(nn.View(128*6*6)) net:add(nn.Dropout(0.5)) net:add(nn.Linear(128*6*6,512)) net:add(nn.BatchNormalization(512)) net:add(nn.ReLU(true)) net:add(nn.Linear(512,nbClasses)) net:add(nn.LogSoftMax()) return net end The first layer contains three input channels because we're going to pass RGB images (three channels). For grayscale images, the first layer has one input channel. I encourage you to play around and modify the network.[2] There are a bunch of new modules that need some elaboration. The Dropout module randomly deactivates a neuron with some probability. It is known to help generalization by preventing co-adaptation between neurons; that is, a neuron should now depend less on its peer, forcing it to learn a bit more. BatchNormalization is a very recent development. It is known to speed up convergence by normalizing the outputs of a layer to unit gaussian using the statistics of a batch. Let’s use this model and train it. In the interest of brievity, I'll use these constructs directly. The code describing these constructs is in datasets/gtsrb.lua. DataGen:trainGenerator(batchSize) DataGen:valGenerator(batchSize) These provide iterators over batches of train and test data respectively. You'll find that the model code (models/vgg_small.lua) in the repo is different. It is designed to allow you to experiment quickly. Using optim to train the model Using a stochastic gradient descent (sgd) from the optim package to minimize a function f looks like this: optim.sgd(feval, params, optimState) Where: feval: A user-defined function that respects the API: f, df/params = feval(params) params: The current parameter vector (a 1D torch.Tensor) optimState: A table of parameters, and state variables, dependent upon the algorithm Since we are optimizing the loss of the neural network, parameters should be the weights and other parameters of the network. We get these as a flattened 1D tensor using model:getParameters. It also returns a tensor containing the gradients of these parameters. This is useful in creating the feval function above. model = createModel() criterion = nn.ClassNLLCriterion() -- criterion we are optimizing: negative log loss params, gradParams = model:getParameters() local function feval() -- criterion.output stores the latest output of criterion return criterion.output, gradParams end We need to create an optimState table and initialize it with a configuration of our optimizer like learning rate and momentum: optimState = { learningRate = 0.01, momentum = 0.9, dampening = 0.0, nesterov = true, } Now, an update to the model should do the following: Compute the output of the model using model:forward(). Compute the loss and the gradients at output layer using criterion:forward() and criterion:backward() respectively. Update the gradients of the model parameters using model:backward(). Update the model using optim.sgd. -- Forward pass output = model:forward(input) loss = criterion:forward(output, target) -- Backward pass critGrad = criterion:backward(output, target) model:backward(input, critGrad) -- Updates optim.sgd(feval, params, optimState) Note: The order above should be respected, as backward assumes forward was run just before it. Changing this order might result in gradients not being computed correctly. Putting it all together Let's put it all together and write a function that trains the model for an epoch. We'll create a loop that iterates over the train data in batches and updates the model. model = createModel() criterion = nn.ClassNLLCriterion() dataGen = DataGen('datasets/GTSRB/') -- Data generator params, gradParams = model:getParameters() batchSize = 32 optimState = { learningRate = 0.01, momentum = 0.9, dampening = 0.0, nesterov = true, } function train() -- Dropout and BN behave differently during training and testing -- So, switch to training mode model:training() local function feval() return criterion.output, gradParams end for input, target in dataGen:trainGenerator(batchSize) do -- Forward pass local output = model:forward(input) local loss = criterion:forward(output, target) -- Backward pass model:zeroGradParameters() -- clear grads from previous update local critGrad = criterion:backward(output, target) model:backward(input, critGrad) -- Updates optim.sgd(feval, params, optimState) end end The test function is extremely similar, except that we don't need to update the parameters: confusion = optim.ConfusionMatrix(nbClasses) -- to calculate accuracies function test() model:evaluate() -- switch to evaluate mode confusion:zero() -- clear confusion matrix for input, target in dataGen:valGenerator(batchSize) do local output = model:forward(input) confusion:batchAdd(output, target) end confusion:updateValids() local test_acc = confusion.totalValid * 100 print(('Test accuracy: %.2f'):format(test_acc)) end Now that everything is set, you can train your network and print the test accuracies: max_epoch = 20 for i = 1,20 do train() test() end An epoch takes around 30 seconds on a TitanX and gives about 97.7% accuracy after 20 epochs. This is a very basic model and honestly I haven't tried optimizing the parameters much. There are a lot of things that can be done to crank up the accuracies. Try different processing procedures. Experiment with the net structure. Different weight initializations, and learning rate schedules. An Ensemble of different models; for example, train multiple models and take a majority vote. You can have a look at the state of the art on this dataset here. They achieve upwards of 99.5% accuracy using a clever method to boost the geometric variation of CNNs. Conclusion We looked at how to build a basic mlp in Torch. We then moved on to building a Convolutional Neural Network and trained it to solve a real-world problem of traffic sign recognition. For a beginner, Torch/LUA might not be as easy. But once you get a hang of it, you have access to a deep learning framework which is very flexible yet fast. You will be able to easily reproduce latest research or try new stuff unlike in rigid frameworks like keras or nolearn. I encourage you to give it a fair try if you are going anywhere near deep learning. Resources Torch Cheat Sheet Awesome Torch Torch Blog Facebook's Resnet Code Oxford's ML Course Practicals Learn torch from Github repos About the author Preetham Sreenivas is a data scientist at Fractal Analytics. Prior to that, he was a software engineer at Directi.
Read more
  • 0
  • 0
  • 11302
Modal Close icon
Modal Close icon