Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7018 Articles
article-image-building-surveys-using-xcode
Packt
19 Jan 2016
14 min read
Save for later

Building Surveys using Xcode

Packt
19 Jan 2016
14 min read
In this article by Dhanushram Balachandran and Edward Cessna author of book Getting Started with ResearchKit, you can find the Softwareitis.xcodeproj project in the Chapter_3/Softwareitis folder of the RKBook GitHub repository (https://github.com/dhanushram/RKBook/tree/master/Chapter_3/Softwareitis). (For more resources related to this topic, see here.) Now that you have learned about the results of tasks from the previous section, we can modify the Softwareitis project to incorporate processing of the task results. In the TableViewController.swift file, let's update the rows data structure to include the reference for processResultsMethod: as shown in the following: //Array of dictionaries. Each dictionary contains [ rowTitle : (didSelectRowMethod, processResultsMethod) ] var rows : [ [String : ( didSelectRowMethod:()->(), processResultsMethod:(ORKTaskResult?)->() )] ] = [] Update the ORKTaskViewControllerDelegate method taskViewController(taskViewController:, didFinishWithReason:, error:) in TableViewController to call processResultsMethod, as shown in the following: func taskViewController(taskViewController: ORKTaskViewController, didFinishWithReason reason: ORKTaskViewControllerFinishReason, error: NSError?) { if let indexPath = tappedIndexPath { //1 let rowDict = rows[indexPath.row] if let tuple = rowDict.values.first { //2 tuple.processResultsMethod(taskViewController.result) } } dismissViewControllerAnimated(true, completion: nil) } Retrieves the dictionary of the tapped row and its associated tuple containing the didSelectRowMethod and processResultsMethod references from rows. Invokes the processResultsMethod with taskViewController.result as the parameter. Now, we are ready to create our first survey. In Survey.swift, under the Surveys folder, you will find two methods defined in the TableViewController extension: showSurvey() and processSurveyResults(). These are the methods that we will be using to create the survey and process the results. Instruction step Instruction step is used to show instruction or introductory content to the user at the beginning or middle of a task. It does not produce any result as its an informational step. We can create an instruction step using the ORKInstructionStep object. It has title and detailText properties to set the appropriate content. It also has the image property to show an image. The ORKCompletionStep is a special type of ORKInstructionStep used to show the completion of a task. The ORKCompletionStep shows an animation to indicate the completion of the task along with title and detailText, similar to ORKInstructionStep. In creating our first Softwareitis survey, let's use the following two steps to show the information: func showSurvey() { //1 let instStep = ORKInstructionStep(identifier: "Instruction Step") instStep.title = "Softwareitis Survey" instStep.detailText = "This survey demonstrates different question types." //2 let completionStep = ORKCompletionStep(identifier: "Completion Step") completionStep.title = "Thank you for taking this survey!" //3 let task = ORKOrderedTask(identifier: "first survey", steps: [instStep, completionStep]) //4 let taskViewController = ORKTaskViewController(task: task, taskRunUUID: nil) taskViewController.delegate = self presentViewController(taskViewController, animated: true, completion: nil) } The explanation of the preceding code is as follows: Creates an ORKInstructionStep object with an identifier "Instruction Step" and sets its title and detailText properties. Creates an ORKCompletionStep object with an identifier "Completion Step" and sets its title property. Creates an ORKOrderedTask object with the instruction and completion step as its parameters. Creates an ORKTaskViewController object with the ordered task that was previously created and presents it to the user. Let's update the processSurveyResults method to process the results of the instruction step and the completion step as shown in the following: func processSurveyResults(taskResult: ORKTaskResult?) { if let taskResultValue = taskResult { //1 print("Task Run UUID : " + taskResultValue.taskRunUUID.UUIDString) print("Survey started at : (taskResultValue.startDate!) Ended at : (taskResultValue.endDate!)") //2 if let instStepResult = taskResultValue.stepResultForStepIdentifier("Instruction Step") { print("Instruction Step started at : (instStepResult.startDate!) Ended at : (instStepResult.endDate!)") } //3 if let compStepResult = taskResultValue.stepResultForStepIdentifier("Completion Step") { print("Completion Step started at : (compStepResult.startDate!) Ended at : (compStepResult.endDate!)") } } } The explanation of the preceding code is given in the following: As mentioned at the beginning, each task run is associated with a UUID. This UUID is available in the taskRunUUID property, which is printed in the first line. The second line prints the start and end date of the task. These are useful user analytics data with regards to how much time the user took to finish the survey. Obtains the ORKStepResult object corresponding to the instruction step using the stepResultForStepIdentifier method of the ORKTaskResult object. Prints the start and end date of the step result, which shows the amount of time for which the instruction step was shown before the user pressed the Get Started or Cancel buttons. Note that, as mentioned earlier, ORKInstructionStep does not produce any results. Therefore, the results property of the ORKStepResult object will be nil. You can use a breakpoint to stop the execution at this line of code and verify it. Obtains the ORKStepResult object corresponding to the completion step. Similar to the instruction step, this prints the start and end date of the step. The preceding code produces screens as shown in the following image: After the Done button is pressed in the completion step, Xcode prints the output that is similar to the following: Task Run UUID : 0A343E5A-A5CD-4E7C-88C6-893E2B10E7F7 Survey started at : 2015-08-11 00:41:03 +0000     Ended at : 2015-08-11 00:41:07 +0000Instruction Step started at : 2015-08-11 00:41:03 +0000   Ended at : 2015-08-11 00:41:05 +0000Completion Step started at : 2015-08-11 00:41:05 +0000   Ended at : 2015-08-11 00:41:07 +0000 Question step Question steps make up the body of a survey. ResearchKit supports question steps with various answer types such as boolean (Yes or No), numeric input, date selection, and so on. Let's first create a question step with the simplest boolean answer type by inserting the following line of code in showSurvey(): let question1 = ORKQuestionStep(identifier: "question 1", title: "Have you ever been diagnosed with Softwareitis?", answer: ORKAnswerFormat.booleanAnswerFormat()) The preceding code creates a ORKQuestionStep object with identifier question 1, title with the question, and an ORKBooleanAnswerFormat object created using the booleanAnswerFormat() class method of ORKAnswerFormat. The answer type for a question is determined by the type of the ORKAnswerFormat object that is passed in the answer parameter. The ORKAnswerFormat has several subclasses such as ORKBooleanAnswerFormat, ORKNumericAnswerFormat, and so on. Here, we are using ORKBooleanAnswerFormat. Don't forget to insert the created question step in the ORKOrderedTask steps parameter by updating the following line: let task = ORKOrderedTask(identifier: "first survey", steps: [instStep, question1, completionStep]) When you run the preceding changes in Xcode and start the survey, you will see the question step with the Yes or No options. We have now successfully added a boolean question step to our survey, as shown in the following image: Now, its time to process the results of this question step. The result is produced in an ORKBooleanQuestionResult object. Insert the following lines of code in processSurveyResults(): //1 if let question1Result = taskResultValue.stepResultForStepIdentifier("question 1")?.results?.first as? ORKBooleanQuestionResult { //2 if question1Result.booleanAnswer != nil { let answerString = question1Result.booleanAnswer!.boolValue ? "Yes" : "No" print("Answer to question 1 is (answerString)") } else { print("question 1 was skipped") } } The explanation of the preceding code is as follows: Obtains the ORKBooleanQuestionResult object by first obtaining the step result using the stepResultForStepIdentifier method, accessing its results property, and finally obtaining the only ORKBooleanQuestionResult object available in the results array. The booleanAnswer property of ORKBooleanQuestionResult contains the user's answer. We will print the answer if booleanAnswer is non-nil. If booleanAnswer is nil, it indicates that the user has skipped answering the question by pressing the Skip this question button. You can disable the skipping-of-a-question step by setting its optional property to false. We can add the numeric and scale type question steps using the following lines of code in showSurvey(): //1 let question2 = ORKQuestionStep(identifier: "question 2", title: "How many apps do you download per week?", answer: ORKAnswerFormat.integerAnswerFormatWithUnit("Apps per week")) //2 let answerFormat3 = ORKNumericAnswerFormat.scaleAnswerFormatWithMaximumValue(10, minimumValue: 0, defaultValue: 5, step: 1, vertical: false, maximumValueDescription: nil, minimumValueDescription: nil) let question3 = ORKQuestionStep(identifier: "question 3", title: "How many apps do you download per week (range)?", answer: answerFormat3) The explanation of the preceding code is as follows: Creates ORKQuestionStep with the ORKNumericAnswerFormat object, created using the integerAnswerFormatWithUnit method with Apps per week as the unit. Feel free to refer to the ORKNumericAnswerFormat documentation for decimal answer format and other validation options that you can use. First creates ORKScaleAnswerFormat with minimum and maximum values and step. Note that the number of step increments required to go from minimumValue to maximumValue cannot exceed 10. For example, maximum value of 100 and minimum value of 0 with a step of 1 is not valid and ResearchKit will raise an exception. The step needs to be at least 10. In the second line, ORKScaleAnswerFormat is fed in the ORKQuestionStep object. The following lines in processSurveyResults() process the results from the number and the scale questions: //1 if let question2Result = taskResultValue.stepResultForStepIdentifier("question 2")?.results?.first as? ORKNumericQuestionResult { if question2Result.numericAnswer != nil { print("Answer to question 2 is (question2Result.numericAnswer!)") } else { print("question 2 was skipped") } } //2 if let question3Result = taskResultValue.stepResultForStepIdentifier("question 3")?.results?.first as? ORKScaleQuestionResult { if question3Result.scaleAnswer != nil { print("Answer to question 3 is (question3Result.scaleAnswer!)") } else { print("question 3 was skipped") } } The explanation of the preceding code is as follows: Question step with ORKNumericAnswerFormat generates the result with the ORKNumericQuestionResult object. The numericAnswer property of ORKNumericQuestionResult contains the answer value if the question is not skipped by the user. The scaleAnswer property of ORKScaleQuestionResult contains the answer for a scale question. As you can see in the following image, the numeric type question generates a free form text field to enter the value, while scale type generates a slider: Let's look at a slightly complicated question type with ORKTextChoiceAnswerFormat. In order to use this answer format, we need to create the ORKTextChoice objects before hand. Each text choice object provides the necessary data to act as a choice in a single choice or multiple choice question. The following lines in showSurvey() create a single choice question with three options: //1 let textChoice1 = ORKTextChoice(text: "Games", detailText: nil, value: 1, exclusive: false) let textChoice2 = ORKTextChoice(text: "Lifestyle", detailText: nil, value: 2, exclusive: false) let textChoice3 = ORKTextChoice(text: "Utility", detailText: nil, value: 3, exclusive: false) //2 let answerFormat4 = ORKNumericAnswerFormat.choiceAnswerFormatWithStyle(ORKChoiceAnswerStyle.SingleChoice, textChoices: [textChoice1, textChoice2, textChoice3]) let question4 = ORKQuestionStep(identifier: "question 4", title: "Which category of apps do you download the most?", answer: answerFormat4) The explanation of the preceding code is as follows: Creates text choice objects with text and value. When a choice is selected, the object in the value property is returned in the corresponding ORKChoiceQuestionResult object. The exclusive property is used in multiple choice questions context. Refer to the documentation for its use. First, creates an ORKChoiceAnswerFormat object with the text choices that were previously created and specifies a single choice type using the ORKChoiceAnswerStyle enum. You can easily change this question to multiple choice question by changing the ORKChoiceAnswerStyle enum to multiple choice. Then, an ORKQuestionStep object is created using the answer format object. Processing the results from a single or multiple choice question is shown in the following. Needless to say, this code goes in the processSurveyResults() method: //1 if let question4Result = taskResultValue.stepResultForStepIdentifier("question 4")?.results?.first as? ORKChoiceQuestionResult { //2 if question4Result.choiceAnswers != nil { print("Answer to question 4 is (question4Result.choiceAnswers!)") } else { print("question 4 was skipped") } } The explanation of the preceding code is as follows: The result for a single or multiple choice question is returned in an ORKChoiceQuestionResult object. The choiceAnswers property holds the array of values for the chosen options. The following image shows the generated choice question UI for the preceding code: There are several other question types, which operate in a very similar manner like the ones we discussed so far. You can find them in the documentations of ORKAnswerFormat and ORKResult classes. The Softwareitis project has implementation of two additional types: date format and time interval format. Using custom tasks, you can create surveys that can skip the display of certain questions based on the answers that the users have provided so far. For example, in a smoking habits survey, if the user chooses "I do not smoke" option, then the ability to not display the "How many cigarettes per day?" question. Form step A form step allows you to combine several related questions in a single scrollable page and reduces the number of the Next button taps for the user. The ORKFormStep object is used to create the form step. The questions in the form are represented using the ORKFormItem objects. The ORKFormItem is similar to ORKQuestionStep, in which it takes the same parameters (title and answer format). Let's create a new survey with a form step by creating a form.swift extension file and adding the form entry to the rows array in TableViewController.swift, as shown in the following: func setupTableViewRows() { rows += [ ["Survey" : (didSelectRowMethod: self.showSurvey, processResultsMethod: self.processSurveyResults)], //1 ["Form" : (didSelectRowMethod: self.showForm, processResultsMethod: self.processFormResults)] ] } The explanation of the preceding code is as follows: The "Form" entry added to the rows array to create a new form survey with the showForm() method to show the form survey and the processFormResults() method to process the results from the form. The following code shows the showForm() method in Form.swift file: func showForm() { //1 let instStep = ORKInstructionStep(identifier: "Instruction Step") instStep.title = "Softwareitis Form Type Survey" instStep.detailText = "This survey demonstrates a form type step." //2 let question1 = ORKFormItem(identifier: "question 1", text: "Have you ever been diagnosed with Softwareitis?", answerFormat: ORKAnswerFormat.booleanAnswerFormat()) let question2 = ORKFormItem(identifier: "question 2", text: "How many apps do you download per week?", answerFormat: ORKAnswerFormat.integerAnswerFormatWithUnit("Apps per week")) //3 let formStep = ORKFormStep(identifier: "form step", title: "Softwareitis Survey", text: nil) formStep.formItems = [question1, question2] //1 let completionStep = ORKCompletionStep(identifier: "Completion Step") completionStep.title = "Thank you for taking this survey!" //4 let task = ORKOrderedTask(identifier: "survey with form", steps: [instStep, formStep, completionStep]) let taskViewController = ORKTaskViewController(task: task, taskRunUUID: nil) taskViewController.delegate = self presentViewController(taskViewController, animated: true, completion: nil) } The explanation of the preceding code is as follows: Creates an instruction and a completion step, similar to the earlier survey. Creates two ORKFormItem objects using the questions from the earlier survey. Notice the similarity with the ORKQuestionStep constructors. Creates ORKFormStep object with an identifier form step and sets the formItems property of the ORKFormStep object with the ORKFormItem objects that are created earlier. Creates an ordered task using the instruction, form, and completion steps and presents it to the user using a new ORKTaskViewController object. The results are processed using the following processFormResults() method: func processFormResults(taskResult: ORKTaskResult?) { if let taskResultValue = taskResult { //1 if let formStepResult = taskResultValue.stepResultForStepIdentifier("form step"), formItemResults = formStepResult.results { //2 for result in formItemResults { //3 switch result { case let booleanResult as ORKBooleanQuestionResult: if booleanResult.booleanAnswer != nil { let answerString = booleanResult.booleanAnswer!.boolValue ? "Yes" : "No" print("Answer to (booleanResult.identifier) is (answerString)") } else { print("(booleanResult.identifier) was skipped") } case let numericResult as ORKNumericQuestionResult: if numericResult.numericAnswer != nil { print("Answer to (numericResult.identifier) is (numericResult.numericAnswer!)") } else { print("(numericResult.identifier) was skipped") } default: break } } } } } The explanation of the preceding code is as follows: Obtains the ORKStepResult object of the form step and unwraps the form item results from the results property. Iterates through each of the formItemResults, each of which will be the result for a question in the form. The switch statement detects the different types of question results and accesses the appropriate property that contains the answer. The following image shows the form step: Considerations for real world surveys Many clinical research studies that are conducted using a pen and paper tend to have well established surveys. When you try to convert these surveys to ResearchKit, they may not convert perfectly. Some questions and answer choices may have to be reworded so that they can fit on a phone screen. You are advised to work closely with the clinical researchers so that the changes in the surveys still produce comparable results with their pen and paper counterparts. Another aspect to consider is to eliminate some of the survey questions if the answers can be found elsewhere in the user's device. For example, age, blood type, and so on, can be obtained from HealthKit if the user has already set them. This will help in improving the user experience of your app. Summary Here we have learned to build surveys using Xcode. Resources for Article: Further resources on this subject: Signing up to be an iOS developer[article] Code Sharing Between iOS and Android[article] Creating a New iOS Social Project[article]
Read more
  • 0
  • 0
  • 16980

article-image-working-with-pandas-dataframes
Sugandha Lahoti
23 Feb 2018
15 min read
Save for later

Working with pandas DataFrames

Sugandha Lahoti
23 Feb 2018
15 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book Python Data Analysis - Second Edition written by Armando Fandango. From this book, you will learn how to process and manipulate data with Python for complex data analysis and modeling. Code bundle for this article is hosted on GitHub.[/box] The popular open source Python library, pandas is named after panel data (an econometric term) and Python data analysis. We shall learn about basic panda functionalities, data structures, and operations in this article. The official pandas documentation insists on naming the project pandas in all lowercase letters. The other convention the pandas project insists on, is the import pandas as pd import statement. We will follow these conventions in this text. In this tutorial, we will install and explore pandas. We will also acquaint ourselves with the a central pandas data structure–DataFrame. Installing and exploring pandas The minimal dependency set requirements for pandas is given as follows: NumPy: This is the fundamental numerical array package that we installed and covered extensively in the preceding chapters python-dateutil: This is a date handling library pytz: This handles time zone definitions This list is the bare minimum; a longer list of optional dependencies can be located at http://pandas.pydata.org/pandas-docs/stable/install.html. We can install pandas via PyPI with pip or easy_install, using a binary installer, with the aid of our operating system package manager, or from the source by checking out the code. The binary installers can be downloaded from http://pandas.pydata.org/getpandas.html. The command to install pandas with pip is as follows: $ pip3 install pandas rpy2 rpy2 is an interface to R and is required because rpy is being deprecated. You may have to prepend the preceding command with sudo if your user account doesn't have sufficient rights. The pandas DataFrames A pandas DataFrame is a labeled two-dimensional data structure and is similar in spirit to a worksheet in Google Sheets or Microsoft Excel, or a relational database table. The columns in pandas DataFrame can be of different types. A similar concept, by the way, was invented originally in the R programming language. (For more information, refer to http://www.r-tutor.com/r-introduction/data-frame). A DataFrame can be created in the following ways: Using another DataFrame. Using a NumPy array or a composite of arrays that has a two-dimensional shape. Likewise, we can create a DataFrame out of another pandas data structure called Series. We will learn about Series in the following section. A DataFrame can also be produced from a file, such as a CSV file. From a dictionary of one-dimensional structures, such as one-dimensional NumPy arrays, lists, dicts, or pandas Series. As an example, we will use data that can be retrieved from http://www.exploredata.net/Downloads/WHO-Data-Set. The original data file is quite large and has many columns, so we will use an edited file instead, which only contains the first nine columns and is called WHO_first9cols.csv; the file is in the code bundle of this book. These are the first two lines, including the header: Country,CountryID,Continent,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) totalAfghanistan,1,1,151,28,,,,26088 In the next steps, we will take a look at pandas DataFrames and its attributes: To kick off, load the data file into a DataFrame and print it on the screen: from pandas.io.parsers import read_csv df = read_csv("WHO_first9cols.csv") print("Dataframe", df) The printout is a summary of the DataFrame. It is too long to be displayed entirely, so we will just grab the last few lines: 199 21732.0 200 11696.0 201 13228.0 [202 rows x 9 columns] The DataFrame has an attribute that holds its shape as a tuple, similar to ndarray. Query the number of rows of a DataFrame as follows: print("Shape", df.shape) print("Length", len(df)) The values we obtain comply with the printout of the preceding step: Shape (202, 9) Length 202 Check the column header and data types with the other attributes: print("Column Headers", df.columns) print("Data types", df.dtypes) We receive the column headers in a special data structure: Column Headers Index([u'Country', u'CountryID', u'Continent', u'Adolescent fertility rate (%)', u'Adult literacy rate (%)', u'Gross national income per capita (PPP international $)', u'Net primary school enrolment ratio female (%)', u'Net primary school enrolment ratio male (%)', u'Population (in thousands) total'], dtype='object') The data types are printed as follows: 4. The pandas DataFrame has an index, which is like the primary key of relational database tables. We can either specify the index or have pandas create it automatically. The index can be accessed with a corresponding property, as follows: Print("Index", df.index) An index helps us search for items quickly, just like the index in this book. In our case, the index is a wrapper around an array starting at 0, with an increment of one for each row: Sometimes, we wish to iterate over the underlying data of a DataFrame. Iterating over column values can be inefficient if we utilize the pandas iterators. It's much better to extract the underlying NumPy arrays and work with those. The pandas DataFrame has an attribute that can aid with this as well: print("Values", df.values) Please note that some values are designated nan in the output, for 'not a number'. These values come from empty fields in the input datafile: The preceding code is available in Python Notebook ch-03.ipynb, available in the code bundle of this book. Querying data in pandas Since a pandas DataFrame is structured in a similar way to a relational database, we can view operations that read data from a DataFrame as a query. In this example, we will retrieve the annual sunspot data from Quandl. We can either use the Quandl API or download the data manually as a CSV file from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. If you want to install the API, you can do so by downloading installers from https://pypi.python.org/pypi/Quandl or by running the following command: $ pip3 install Quandl Using the API is free, but is limited to 50 API calls per day. If you require more API calls, you will have to request an authentication key. The code in this tutorial is not using a key. It should be simple to change the code to either use a key or read a downloaded CSV file. If you have difficulties, search through the Python docs at https://docs.python.org/2/. Without further preamble, let's take a look at how to query data in a pandas DataFrame: As a first step, we obviously have to download the data. After importing the Quandl API, get the data as follows: import quandl # Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual # PyPi url https://pypi.python.org/pypi/Quandl sunspots = quandl.get("SIDC/SUNSPOTS_A") The head() and tail() methods have a purpose similar to that of the Unix commands with the same name. Select the first n and last n records of a DataFrame, where n is an integer parameter: print("Head 2", sunspots.head(2) ) print("Tail 2", sunspots.tail(2)) This gives us the first two and last two rows of the sunspot data (for the sake of brevity we have not shown all the columns here; your output will have all the columns from the dataset): Head 2          Number Year 1700-12-31      5 1701-12-31     11 [2 rows x 1 columns] Tail 2          Number Year 2012-12-31 57.7 2013-12-31 64.9 [2 rows x 1 columns] Please note that we only have one column holding the number of sunspots per year. The dates are a part of the DataFrame index. The following is the query for the last value using the last date: last_date = sunspots.index[-1] print("Last value", sunspots.loc[last_date]) You can check the following output with the result from the previous step: Last value Number        64.9 Name: 2013-12-31 00:00:00, dtype: float64 Query the date with date strings in the YYYYMMDD format as follows: print("Values slice by date:n", sunspots["20020101": "20131231"]) This gives the records from 2002 through to 2013: Values slice by date                             Number Year 2002-12-31     104.0 [TRUNCATED] 2013-12-31       64.9 [12 rows x 1 columns] A list of indices can be used to query as well: print("Slice from a list of indices:n", sunspots.iloc[[2, 4, -4, -2]]) The preceding code selects the following rows: Slice from a list of indices                              Number Year 1702-12-31       16.0 1704-12-31       36.0 2010-12-31       16.0 2012-12-31       57.7 [4 rows x 1 columns] To select scalar values, we have two options. The second option given here should be faster. Two integers are required, the first for the row and the second for the column: print("Scalar with Iloc:", sunspots.iloc[0, 0]) print("Scalar with iat", sunspots.iat[1, 0]) This gives us the first and second values of the dataset as scalars: Scalar with Iloc 5.0 Scalar with iat 11.0 Querying with Booleans works much like the Where clause of SQL. The following code queries for values larger than the arithmetic mean. Note that there is a difference between when we perform the query on the whole DataFrame and when we perform it on a single column: print("Boolean selection", sunspots[sunspots > sunspots.mean()]) print("Boolean selection with column label:n", sunspots[sunspots['Number of Observations'] > sunspots['Number of Observations'].mean()]) The notable difference is that the first query yields all the rows, with some rows not conforming to the condition that has a value of NaN. The second query returns only the rows where the value is larger than the mean: Boolean selection                             Number Year 1700-12-31          NaN [TRUNCATED] 1759-12-31       54.0 ... [314 rows x 1 columns] Boolean selection with column label                              Number Year 1705-12-31       58.0 [TRUNCATED] 1870-12-31     139.1 ... [127 rows x 1 columns] The preceding example code is in the ch_03.ipynb file of this book's code bundle. Data aggregation with pandas DataFrames Data aggregation is a term used in the field of relational databases. In a database query, we can group data by the value in a column or columns. We can then perform various operations on each of these groups. The pandas DataFrame has similar capabilities. We will generate data held in a Python dict and then use this data to create a pandas DataFrame. We will then practice the pandas aggregation features: Seed the NumPy random generator to make sure that the generated data will not differ between repeated program runs. The data will have four columns: Weather (a string) Food (also a string) Price (a random float) Number (a random integer between one and nine) The use case is that we have the results of some sort of consumer-purchase research, combined with weather and market pricing, where we calculate the average of prices and keep a track of the sample size and parameters: import pandas as pd from numpy.random import seed from numpy.random import rand from numpy.random import rand_int import numpy as np seed(42) df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold','hot', 'cold', 'hot', 'cold'], 'Food' : ['soup', 'soup', 'icecream', 'chocolate', 'icecream', 'icecream', 'soup'], 'Price' : 10 * rand(7), 'Number' : rand_int(1, 9,)}) print(df) You should get an output similar to the following: Please note that the column labels come from the lexically ordered keys of the Python dict. Lexical or lexicographical order is based on the alphabetic order of characters in a string. Group the data by the Weather column and then iterate through the groups as follows: weather_group = df.groupby('Weather') i = 0 for name, group in weather_group: i = i + 1 print("Group", i, name) print(group) We have two types of weather, hot and cold, so we get two groups: The weather_group variable is a special pandas object that we get as a result of the groupby() method. This object has aggregation methods, which are demonstrated as follows: print("Weather group firstn", weather_group.first()) print("Weather group lastn", weather_group.last()) print("Weather group meann", weather_group.mean()) The preceding code snippet prints the first row, last row, and mean of each group: Just as in a database query, we are allowed to group on multiple columns. The groups attribute will then tell us the groups that are formed, as well as the rows in each group: wf_group = df.groupby(['Weather', 'Food']) print("WF Groups", wf_group.groups) For each possible combination of weather and food values, a new group is created. The membership of each row is indicated by their index values as follows: WF Groups {('hot', 'chocolate'): [3], ('cold', 'icecream'): [2, 4], ('hot', 'icecream'): [5], ('hot', 'soup'): [1], ('cold', 'soup'): [0, 6] 5. Apply a list of NumPy functions on groups with the agg() method: print("WF Aggregatedn", wf_group.agg([np.mean, np.median])) Obviously, we could apply even more functions, but it would look messier than the following output: Concatenating and appending DataFrames The pandas DataFrame allows operations that are similar to the inner and outer joins of database tables. We can append and concatenate rows as well. To practice appending and concatenating of rows, we will reuse the DataFrame from the previous section. Let's select the first three rows: print("df :3n", df[:3]) Check that these are indeed the first three rows: df :3 Food Number        Price Weather 0           soup              8 3.745401       cold 1           soup              5 9.507143         hot 2 icecream              4 7.319939       cold The concat() function concatenates DataFrames. For example, we can concatenate a DataFrame that consists of three rows to the rest of the rows, in order to recreate the original DataFrame: print("Concat Back togethern", pd.concat([df[:3], df[3:]])) The concatenation output appears as follows: Concat Back together Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold 3 chocolate 8 5.986585 hot 4 icecream 8 1.560186 cold 5 icecream 3 1.559945 hot 6 soup 6 0.580836 cold [7 rows x 4 columns] To append rows, use the append() function: print("Appending rowsn", df[:3].append(df[5:])) The result is a DataFrame with the first three rows of the original DataFrame and the last two rows appended to it: Appending rows Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold 5 icecream 3 1.559945 hot 6 soup 6 0.580836 cold [5 rows x 4 columns] Joining DataFrames To demonstrate joining, we will use two CSV files-dest.csv and tips.csv. The use case behind it is that we are running a taxi company. Every time a passenger is dropped off at his or her destination, we add a row to the dest.csv file with the employee number of the driver and the destination: EmpNr,Dest5,The Hague3,Amsterdam9,Rotterdam Sometimes drivers get a tip, so we want that registered in the tips.csv file (if this doesn't seem realistic, please feel free to come up with your own story): EmpNr,Amount5,109,57,2.5 Database-like joins in pandas can be done with either the merge() function or the join() DataFrame method. The join() method joins onto indices by default, which might not be what you want. In SQL a relational database query language we have the inner join, left outer join, right outer join, and full outer join. An inner join selects rows from two tables, if and only if values match, for columns specified in the join condition. Outer joins do not require a match, and can potentially return more rows. More information on joins can be found at http://en.wikipedia.org/wiki/Join_%28SQL%29. All these join types are supported by pandas, but we will only take a look at inner joins and full outer joins: A join on the employee number with the merge() function is performed as follows: print("Merge() on keyn", pd.merge(dests, tips, on='EmpNr')) This gives an inner join as the outcome: Merge() on key EmpNr            Dest           Amount 0 5 The Hague 10 1 9 Rotterdam 5 [2 rows x 3 columns] Joining with the join() method requires providing suffixes for the left and right operands: print("Dests join() tipsn", dests.join(tips, lsuffix='Dest', rsuffix='Tips')) This method call joins index values so that the result is different from an SQL inner join: Dests join() tips EmpNrDest Dest EmpNrTips Amount 0 5 The Hague 5 10.0 1 3 Amsterdam 9 5.0 2 9 Rotterdam 7 2.5 [3 rows x 4 columns] An even more explicit way to execute an inner join with merge() is as follows: print("Inner join with merge()n", pd.merge(dests, tips, how='inner')) The output is as follows: Inner join with merge() EmpNr            Dest           Amount 0 5 The Hague 10 1 9 Rotterdam 5 [2 rows x 3 columns] To make this a full outer join requires only a small change: print("Outer joinn", pd.merge(dests, tips, how='outer')) The outer join adds rows with NaN values: Outer join EmpNr            Dest            Amount 0 5 The Hague 10.0 1 3 Amsterdam NaN 2 9 Rotterdam 5.0 3 7 NaN            2.5 [4 rows x 3 columns] In a relational database query, these values would have been set to NULL. The demo code is in the ch-03.ipynb file of this book's code bundle. We learnt how to perform various data manipulation techniques such as aggregating, concatenating, appending, cleaning, and handling missing values, with pandas. If you found this post useful, check out the book Python Data Analysis - Second Edition to learn advanced topics such as signal processing, textual data analysis, machine learning, and more.  
Read more
  • 0
  • 0
  • 16966

article-image-r-statistical-package-interfacing-python
Janu Verma
17 Nov 2016
8 min read
Save for later

The R Statistical Package Interfacing with Python

Janu Verma
17 Nov 2016
8 min read
One of my coding hobbies is to explore different Python packages and libraries. In this post, I'll talk about the package rpy2, which is used to call R inside python. Being an avid user of R and a huge supporter of R graphical packages, I had always desired to call R inside my Python code to be able to produce beautiful visualizations. The R framework offers machinery for a variety of statistical and data mining tasks. Let's review the basics of R before we delve into R-Python interfacing. R is a statistical language which is free, is open source, and has comprehensive support for various statistical, data mining, and visualization tasks. Quick-R describes it as: "R is an elegant and comprehensive statistical and graphical programming language." R is one of the fastest growing languages, mainly due to the surge in interest in statistical learning and data science. The Data Science Specialization on Coursera has all courses taught in R. There are R packages for machine learning, graphics, text mining, bioinformatics, topics modeling, interactive visualizations, markdown, and many others. In this post, I'll give a quick introduction to R. The motivation is to acquire some knowledge of R to be able to follow the discussion on R-Python interfacing. Installing R R can be downloaded from one of the Comprehensive R Archive Network (CRAN) mirror sites. Running R To run R interactively on the command line, type r. Launch the standard GUI (which should have been included in the download) and type R code in it. RStudio is the most popular IDE for R. It is recommended, though not required, to install RStudio and run R on it. To write a file with R code, create a file with the .r extension (for example, myFirstCode.r). And run the code by typing the following on the terminal: Rscript file.r Basics of R The most fundamental data structure in R is a vector; actually everything in R is a vector (even numbers are 1-dimensional vectors). This is one of the strangest things about R. Vectors contain elements of the same type. A vector is created by using the c() function. a = c(1,2,5,9,11) a [1] 1 2 5 9 11 strings = c("aa", "apple", "beta", "down") strings [1] "aa" "apple" "beta" "down" The elements in a vector are indexed, but the indexing starts at 1 instead of 0, as in most major languages (for example, python). strings[1] [1] "aa" The fact that everything in R is a vector and that the indexing starts at 1 are the main reasons for people's initial frustration with R (I forget this all the time). Data Frames A lot of R packages expect data as a data frame, which are essentially matrices but the columns can be accessed by names. The columns can be of different types. Data frames are useful outside of R also. The Python package Pandas was written primarily to implement data frames and to do analysis on them. In R, data frames are created (from vectors) as follows: students = c("Anne", "Bret", "Carl", "Daron", "Emily") scores = c(7,3,4,9,8) grades = c('B', 'D', 'C', 'A', 'A') results = data.frame(students, scores, grades) results students scores grades 1 Anne 7 B 2 Bret 3 D 3 Carl 4 C 4 Daron 9 A 5 Emily 8 A The elements of a data frame can be accessed as: results$students [1] Anne Bret Carl Daron Emily Levels: Anne Bret Carl Daron Emily This gives a vector, the elements of which can be called by indexing. results$students[1] [1] Anne Levels: Anne Bret Carl Daron Emily Reading Files Most of the times the data is given as a comma-separated values (csv) file or a tab-separated values (tsv) file. We will see how to read a csv/tsv file in R and create a data frame from it. (Aside: The datasets in most Kaggle competitions are given as csv files and we are required to do machine learning on them. In Python, one creates a pandas data frame or a numpy array from this csv file.) In R, we use a read.csv or read.table command to load a csv file into memory, for example, for the Titanic competition on Kaggle: training_data <- read.csv("train.csv", header=TRUE) train <- data.frame(survived=train_all$Survived, age=train_all$Age, fare=train_all$Fare, pclass=train_all$Pclass) Similarly, a tsv file can be loaded as: data <- read.csv("file.tsv";, header=TRUE, delimiter="t") Thus given a csv/tsv file with or without headers, we can read it using the read.csv function and create a data frame using: data.frame(vector_1, vector_2, ... vector_n). This should be enough to start exploring R packages. Another command that is very useful in R is head(), which is similar to the less command on Unix. rpy2 First things first, we need to have both Python and R installed. Then install rpy2 from the Python package index (Pypi). To do this, simply type the following on the command line: pip install rpy2 We will use the high-level interface to R, the robjects subpackage of rpy2. import rpy2.robjects as ro We can pass commands to the R session by putting the R commands in the ro.r() method as strings. Recall that everything in R is a vector. Let's create a vector using robjects: ro.r('x=c(2,4,6,8)') print(ro.r('x')) [1] 2 4 6 8 Keep in mind that though x is an R object (vector), ro.r('x') is a Python object (rpy2 object). This can be checked as follows: type(ro.r('x')) <class 'rpy2.robjects.vectors.FloatVector'> The most important data types in R are data frames, which are essentially matrices. We can create a data frame using rpy2: ro.r('x=c(2,4,6,8)') ro.r('y=c(4,8,12,16)') ro.r('rdf=data.frame(x,y)') This created an R data frame, rdf. If we want to manipulate this data frame using Python, we need to convert it to a python object. We will convert the R data frame to a pandas data frame. The Python package pandas contains efficient implementations of data frame objects in python. import pandas.rpy.common as com df = com.load_data('rdf') print type(df) <class 'pandas.core.frame.DataFrame'> df.x = 2*df.x Here we have doubled each of the elements of the x vector in the data frame df. But df is a Python object, which we can convert back to an R data frame using pandas as: rdf = com.convert_to_r_dataframe(df) print type(rdf) <class 'rpy2.robjects.vectors.DataFrame'> Let's use the plotting machinery of R, which is the main purpose of studying rpy2: ro.r('plot(x,y)') Not only R data types, but rpy2 lets us import R packages as well (given that these packages are installed on R) and use them for analysis. Here we will build a linear model on x and y using the R package stats: from rpy2.robjects.packages import importr stats = importr('stats') base = importr('base') fit = stats.lm('y ~ x', data=rdf) print(base.summary(fit)) We get the following results: Residuals: 1 2 3 4 0 0 0 0 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0 0 NA NA x 2 0 Inf <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0 on 2 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: Inf on 1 and 2 DF, p-value: < 2.2e-16 R programmers will immediately recognize the output as coming from applying linear model function lm() on data. I'll end this discussion with an example using my favorite R package ggplot2. I have written a lot of posts on data visualization using ggplot2. The following example is borrowed from the official documentation of rpy2. import math, datetime import rpy2.robjects.lib.ggplot2 as ggplot2 import rpy2.robjects as ro from rpy2.robjects.packages import importr base = importr('base') datasets = importr('datasets') mtcars = datasets.data.fetch('mtcars')['mtcars'] pp = ggplot2.ggplot(mtcars) + ggplot2.aes_string(x='wt', y='mpg', col='factor(cyl)') + ggplot2.geom_point() + ggplot2.geom_smooth(ggplot2.aes_string(group = 'cyl'), method = 'lm') pp.plot() Author: Janu Verma is a researcher in the IBM T.J. Watson Research Center, New York. His research interests are in mathematics, machine learning, information visualization, computational biology, and healthcare analytics. He has held research positions at Cornell University, Kansas State University, Tata Institute of Fundamental Research, Indian Institute of Science, and the Indian Statistical Institute. He has written papers for IEEE Vis, KDD, International Conference on HealthCare Informatics, Computer Graphics and Applications, Nature Genetics, IEEE Sensors Journals and so on. His current focus is on the development of visual analytics systems for prediction and understanding. He advises start-ups and other companies on data science and machine learning in the Delhi-NCR area. He can be found at Here.
Read more
  • 0
  • 0
  • 16961

article-image-tuning-solr-jvm-and-container
Packt
22 Jul 2014
6 min read
Save for later

Tuning Solr JVM and Container

Packt
22 Jul 2014
6 min read
(For more resources related to this topic, see here.) Some of these JVMs are commercially optimized for production usage; you may find comparison studies at http://dior.ics.muni.cz/~makub/java/speed.html. Some of the JVM implementations provide server versions, which would be more appropriate than normal ones. Since Solr runs in JVM, all the standard optimizations for applications are applicable to it. It starts with choosing the right heap size for your JVM. The heap size depends upon the following aspects: Use of facets and sorting options Size of the Solr index Update frequencies on Solr Solr cache Heap size for JVM can be controlled by the following parameters: Parameter Description -Xms This is the minimum heap size required during JVM initialization, that is, container -Xmx This is the maximum heap size up to which the JVM or J2EE container can consume Deciding heap size Heap in JVM contributes as a major factor while optimizing the performance of any system. JVM uses heap to store its objects, as well as its own content. Poor allocation of JVM heap results in Java heap space OutOfMemoryError thrown at runtime crashing the application. When the heap is allocated with less memory, the application takes a longer time to initialize, as well as slowing the execution speed of the Java process during runtime. Similarly, higher heap size may underutilize expensive memory, which otherwise could have been used by the other application. JVM starts with initial heap size, and as the demand grows, it tries to resize the heap to accommodate new space requirements. If a demand for memory crosses the maximum limit, JVM throws an Out of Memory exception. The objects that expire or are unused, unnecessarily consume memory in JVM. This memory can be taken back by releasing these objects by a process called garbage collection. Although it's tricky to find out whether you should increase or reduce the heap size, there are simple ways that can help you out. In a memory graph, typically, when you start the Solr server and run your first query, the memory usage increases, and based on subsequent queries and memory size, the memory graph may increase or remain constant. When garbage collection is run automatically by the JVM container, it sharply brings down its usage. If it's difficult to trace GC execution from the memory graph, you can run Solr with the following additional parameters: -Xloggc:<some file> -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails If you are monitoring the heap usage continuously, you will find a graph that increases and decreases (sawtooth); the increase is due to the querying that is going on consistently demanding more memory by your Solr cache, and decrease is due to GC execution. In a running environment, the average heap size should not grow over time or the number of GC runs should be less than the number of queries executed on Solr. If that's not the case, you will need more memory. Features such as Solr faceting and sorting requires more memory on top of traditional search. If memory is unavailable, the operating system needs to perform hot swapping with the storage media, thereby increasing the response time; thus, users find huge latency while searching on large indexes. Many of the operating systems allow users to control swapping of programs. How can we optimize JVM? Whenever a facet query is run in Solr, memory is used to store each unique element in the index for each field. So, for example, a search over a small set of facet value (an year from 1980 to 2014) will consume less memory than a search with larger set of facet value, such as people's names (can vary from person to person). To reduce the memory usage, you may set the term index divisor to 2 (default is 4) by setting the following in solrconfig.xml: <indexReaderFactory name="IndexReaderFactory" class="solr.StandardIndexReaderFactory"> <int name="setTermIndexDivisor">2</int> </indexReaderFactory > From Solr 4.x onwards, the ability to set the min, max (term index divisor) block size ability is not available. This will reduce the memory usage for storing all the terms to half; however, it will double the seek time for terms and will impact a little on your search runtime. One of the causes of large heap is the size of index, so one solution is to introduce SolrCloud and the distributed large index into multiple shards. This will not reduce your memory requirement, but will spread it across the cluster. You can look at some of the optimized GC parameters described at http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning page. Similarly, Oracle provides a GC tuning guide for advanced development stages, and it can be seen at http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html. Additionally, you can look at the Solr performance problems at http://wiki.apache.org/solr/SolrPerformanceProblems. Optimizing JVM container JVM containers allow users to have their requests served in threads. This in turn enables JVM to support concurrent sessions created for different users connecting at the same time. The concurrency can, however, be controlled to reduce the load on the search server. If you are using Apache Tomcat, you can modify the following entries in server.xml for changing the number of concurrent connections: Similarly, in Jetty, you can control the number of connections held by modifying jetty.xml: Similarly, for other containers, these files can change appropriately. Many containers provide a cache on top of the application to avoid server hits. This cache can be utilized for static pages such as the search page. Containers such as Weblogic provide a development versus production mode. Typically, a development mode runs with 15 threads and a limited JDBC pool size by default, whereas, for a production mode, this can be increased. For tuning containers, besides standard optimization, specific performance-tuning guidelines should be followed, as shown in the following table: Container Performance tuning guide Jetty http://wiki.eclipse.org/Jetty/Howto/High_Load Tomcat http://www.mulesoft.com/tcat/tomcat-performance and http://javamaster.wordpress.com/2013/03/13/apache-tomcat-tuning-guide/ JBoss https://access.redhat.com/site/documentation/en-US/JBoss_Enterprise_Application_Platform/5/pdf/Performance_Tuning_Guide/JBoss_Enterprise_Application_Platform-5-Performance_Tuning_Guide-en-US.pdf Weblogic http://docs.oracle.com/cd/E13222_01/wls/docs92/perform/WLSTuning.html Websphere http://www.ibm.com/developerworks/websphere/techjournal/0909_blythe/0909_blythe.html Apache Solr works better with the default container it ships with, Jetty, since it offers a small footprint compared to other containers such as JBoss and Tomcat for which the memory required is a little higher. Summary In this article, we have learned about about Apache Solr which runs on the underlying JVM in the J2EE container and tuning containers. Resources for Article: Further resources on this subject: Apache Solr: Spellchecker, Statistics, and Grouping Mechanism [Article] Getting Started with Apache Solr [Article] Apache Solr PHP Integration [Article]
Read more
  • 0
  • 0
  • 16951

article-image-bias-variance-tradeoff-choose-bias-and-variance-machine-learning-model-tutorial
Savia Lobo
17 Sep 2018
15 min read
Save for later

Bias-Variance tradeoff: How to choose between bias and variance for your machine learning model [Tutorial]

Savia Lobo
17 Sep 2018
15 min read
This article is an excerpt taken from the book Hands-On Data Science and Python Machine Learning authored by Frank Kane. In this article, we're going to start by talking about the bias-variance trade-off, which is kind of a more principled way of talking about the different ways you might overfit and underfit data, and how it all interrelates with each other. We will later talk about the k-fold cross-validation technique, which is an important tool in your chest to combat overfitting and look at how to implement it using Python. Finally, we look at how to detect outliers and deal with them. Bias is just how far off you are from the correct values, that is, how good are your predictions overall in predicting the right overall value. If you take the mean of all your predictions, are they more or less on the right spot? Or are your errors all consistently skewed in one direction or another? If so, then your predictions are biased in a certain direction. Variance is just a measure of how spread out, or how scattered your predictions are. So, if your predictions are all over the place, then that's high variance. But, if they're very tightly focused on what the correct values are, or even an incorrect value in the case of high bias, then your variance is small. In reality, you often need to choose between bias and variance. It comes down to overfitting Vs underfitting your data. Let's take a look at the following example: It's a little bit of a different way of thinking of bias and variance. So, in the left graph, we have a straight line, and you can think of that as having very low variance, relative to these observations. So, there's not a lot of variance in this line, that is, there is low variance. But the bias, the error from each individual point, is actually high. Now, contrast that to the overfitted data in the graph at the right, where we've kind of gone out of our way to fit the observations. The line has high variance, but low bias, because each individual point is pretty close to where it should be. So, this is an example of where we traded off variance for bias. At the end of the day, you're not out to just reduce bias or just reduce variance, you want to reduce error. That's what really matters, and it turns out you can express error as a function of bias and variance: Looking at this, error is equal to bias squared plus variance. So, these things both contribute to the overall error, with bias actually contributing more. But keep in mind, it's error you really want to minimize, not the bias or the variance specifically, and that an overly complex model will probably end up having a high variance and low bias, whereas a too simple model will have low variance and high bias. However, they could both end up having similar error terms at the end of the day. You just have to find the right happy medium of these two things when you're trying to fit your data. This is bias-variance trade-off. You know the decision you have to make between how overall accurate your values are, and how spread out they are or how tightly clustered they are. That's the bias-variance trade-off and they both contribute to the overall error, which is the thing you really care about minimizing. So, keep those terms in mind! K-fold cross-validation to avoid overfitting Train and test as a good way of preventing overfitting and actually measuring how well your model can perform on data it's never seen before. We can take that to the next level with a technique called k-fold cross-validation. So, let's talk about this powerful tool in your arsenal for fighting overfitting; k-fold cross-validation and learn how that works. The idea, although it sounds complicated, is fairly simple: Instead of dividing our data into two buckets, one for training and one for testing, we divide it into K buckets. We reserve one of those buckets for testing purposes, for evaluating the results of our model. We train our model against the remaining buckets that we have, K-1, and then we take our test dataset and use that to evaluate how well our model did amongst all of those different training datasets. We average those resulting error metrics, that is, those r-squared values, together to get a final error metric from k-fold cross-validation. Example of k-fold cross-validation using scikit-learn Fortunately, scikit-learn makes this really easy to do, and it's even easier than doing normal train/test! It's extremely simple to do k-fold cross-validation, so you may as well just do it. Now, the way this all works in practice is you will have a model that you're trying to tune, and you will have different variations of that model, different parameters you might want to tweak on it, right? Like, for example, the degree of polynomial for a polynomial fit. So, the idea is to try different values of your model, different variations, measure them all using k-fold cross-validation, and find the one that minimizes error against your test dataset. That's kind of your sweet spot there. In practice, you want to use k-fold cross-validation to measure the accuracy of your model against a test dataset, and just keep refining that model, keep trying different values within it, keep trying different variations of that model or maybe even different models entirely, until you find the technique that reduces error the most, using k-fold cross validation. Please go ahead and open up the KFoldCrossValidation.ipynb and follow along if you will. We're going to look at the Iris dataset again; remember we introduced this when we talk about dimensionality reduction? We're going to use the SVC model. If you remember back again, that's just a way of classifying data that's pretty robust. There's a section on that if you need to go and refresh your memory: import numpy as np from sklearn import cross_validation from sklearn import datasets from sklearn import svm iris = datasets.load_iris() # Split the iris data into train/test data sets with #40% reserved for testing X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=0) # Build an SVC model for predicting iris classifications #using training data clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train) # Now measure its performance with the test data clf.score(X_test, y_test) What we do is use the cross_validation library from scikit-learn, and we start by just doing a conventional train test split, just a single train/test split, and see how that will work. To do that we have a train_test_split() function that makes it pretty easy. So, the way this works is we feed into train_test_split() a set of feature data. iris.data just contains all the actual measurements of each flower. iris.target is basically the thing we're trying to predict. In this case, it contains all the species for each flower. test_size says what percentage do we want to train versus test. So, 0.4 means we're going to extract 40% of that data randomly for testing purposes, and use 60% for training purposes. What this gives us back is 4 datasets, basically, a training dataset and a test dataset for both the feature data and the target data. So, X_train ends up containing 60% of our Iris measurements, and X_test contains 40% of the measurements used for testing the results of our model. y_train and y_test contain the actual species for each one of those segments. Then after that we go ahead and build an SVC model for predicting Iris species given their measurements, and we build that only using the training data. We fit this SVC model, using a linear kernel, using only the training feature data, and the training species data, that is, target data. We call that model clf. Then, we call the score() function on clf to just measure its performance against our test dataset. So, we score this model against the test data we reserved for the Iris measurements, and the test Iris species, and see how well it does: It turns out it does really well! Over 96% of the time, our model is able to correctly predict the species of an Iris that it had never seen before, just based on the measurements of that Iris. So that's pretty cool! But, this is a fairly small dataset, about 150 flowers if I remember right. So, we're only using 60% of 150 flowers for training and only 40% of 150 flowers for testing. These are still fairly small numbers, so we could still be overfitting to our specific train/test split that we made. So, let's use k-fold cross-validation to protect against that. It turns out that using k-fold cross-validation, even though it's a more robust technique, is actually even easier to use than train/test. So, that's pretty cool! So, let's see how that works: # We give cross_val_score a model, the entire data set and its "real" values, and the number of folds: scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5) # Print the accuracy for each fold: print scores # And the mean accuracy of all 5 folds: print scores.mean() We have a model already, the SVC model that we defined for this prediction, and all you need to do is call cross_val_score() on the cross_validation package. So, you pass in this function a model of a given type (clf), the entire dataset that you have of all of the measurements, that is, all of my feature data (iris.data) and all of my target data (all of the species), iris.target. I want cv=5 which means it's actually going to use 5 different training datasets while reserving 1 for testing. Basically, it's going to run it 5 times, and that's all we need to do. That will automatically evaluate our model against the entire dataset, split up five different ways, and give us back the individual results. If we print back the output of that, it gives us back a list of the actual error metric from each one of those iterations, that is, each one of those folds. We can average those together to get an overall error metric based on k-fold cross-validation: When we do this over 5 folds, we can see that our results are even better than we thought! 98% accuracy. So that's pretty cool! In fact, in a couple of the runs we had perfect accuracy. So that's pretty amazing stuff. Now let's see if we can do even better. We used a linear kernel before, what if we used a polynomial kernel and got even fancier? Will that be overfitting or will it actually better fit the data that we have? That kind of depends on whether there's actually a linear relationship or polynomial relationship between these petal measurements and the actual species or not. So, let's try that out: clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train) scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5) print scores print scores.mean() We'll just run this all again, using the same technique. But this time, we're using a polynomial kernel. We'll fit that to our training dataset, and it doesn't really matter where you fit to in this case, because cross_val_score() will just keep re-running it for you: It turns out that when we use a polynomial fit, we end up with an overall score that's even lower than our original run. So, this tells us that the polynomial kernel is probably overfitting. When we use k-fold cross-validation it reveals an actual lower score than with our linear kernel. The important point here is that if we had just used a single train/test split, we wouldn't have realized that we were overfitting. We would have actually gotten the same result if we just did a single train/test split here as we did on the linear kernel. So, we might inadvertently be overfitting our data there, and not have even known it had we not use k-fold cross-validation. So, this is a good example of where k-fold comes to the rescue, and warns you of overfitting, where a single train/test split might not have caught that. So, keep that in your tool chest. If you want to play around with this some more, go ahead and try different degrees. So, you can actually specify a different number of degrees. The default is 3 degrees for the polynomial kernel, but you can try a different one, you can try two. Detecting outliers A common problem with real-world data is outliers. You'll always have some strange users or some strange agents that are polluting your data, that act abnormally and atypically from the typical user. They might be legitimate outliers; they might be caused by real people and not by some sort of malicious traffic, or fake data. So sometimes, it's appropriate to remove them, sometimes it isn't. Dealing with outliers So, let's take some example code, and see how you might handle outliers in practice. Let's mess around with some outliers. It's a pretty simple section. A little bit of review actually. If you want to follow along, we're in Outliers.ipynb. So, go ahead and open that up if you'd like: import numpy as np incomes = np.random.normal(27000, 15000, 10000) incomes = np.append(incomes, [1000000000]) import matplotlib.pyplot as plt plt.hist(incomes, 50) plt.show() What we're going to do is start off with a normal distribution of incomes here that are have a mean of $27,000 per year, with a standard deviation of 15,000. I'm going to create 10,000 fake Americans that have an income in that distribution. This is totally made-up data, by the way, although it's not that far off from reality. Then, I'm going to stick in an outlier - call it Donald Trump, who has a billion dollars. We're going to stick this guy in at the end of our dataset. So, we have a normally distributed dataset around $27,000, and then we're going to stick in Donald Trump at the end. We'll go ahead and plot that as a histogram: We have the entire normal distribution of everyone else in the country squeezed into one bucket of the histogram. On the other hand, we have Donald Trump out at the right side screwing up the whole thing at a billion dollars. The other problem too is that if I'm trying to answer the question how much money does the typical American make. If I take the mean to try and figure that out, it's not going to be a very good, useful number: incomes.mean () The output of the preceding code is as follows: 126892.66469341301 Donald Trump has pushed that number up all by himself to $126,000 and some odd of change, when I know that the real mean of my normally distributed data that excludes Donald Trump is only $27,000. So, the right thing to do there would be to use the median instead of the mean. A better thing to do would be to actually measure the standard deviation of your dataset, and identify outliers as being some multiple of a standard deviation away from the mean. So, following is a little function that I wrote that does just that. It's called reject_outliers(): def reject_outliers(data): u = np.median(data) s = np.std(data) filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)] return filtered filtered = reject_outliers(incomes) plt.hist(filtered, 50) plt.show() It takes in a list of data and finds the median. It also finds the standard deviation of that dataset. So, I filter that out, so I only preserve data points that are within two standard deviations of the median for my data. So, I can use this handy dandy reject_outliers() function on my income data, to actually strip out weird outliers automatically: Sure enough, it works! I get a much prettier graph now that excludes Donald Trump and focuses in on the more typical dataset here in the center. So, pretty cool stuff! So, that's one example of identifying outliers, and automatically removing them, or dealing with them however you see fit. Remember, always do this in a principled manner. Don't just throw out outliers because they're inconvenient. Understand where they're coming from, and how they actually affect the thing you're trying to measure in spirit. By the way, our mean is also much more meaningful now; much closer to 27,000 that it should be, now that we've gotten rid of that outlier. In this article we came across the Bias-variance tradeoff and how to minimize the error. We also saw the concept of k-fold cross-validation and how to implement it in Python to prevent overfitting. If you've enjoyed this excerpt, head over to the book Hands-On Data Science and Python Machine Learning to prepare your data for analysis, training machine learning models, and visualizing the final data analysis, and much more. 20 lessons on bias in machine learning systems by Kate Crawford at NIPS 2017 Here’s how you can handle the bias variance trade-off in your ML models
Read more
  • 0
  • 0
  • 16949

article-image-using-firebase-real-time-database
Oliver Blumanski
18 Jan 2017
5 min read
Save for later

Using the Firebase Real-Time Database

Oliver Blumanski
18 Jan 2017
5 min read
In this post, we are going to look at how to use the Firebase real-time database, along with an example. Here we are writing and reading data from the database using multiple platforms. To do this, we first need a server script that is adding data, and secondly we need a component that pulls the data from the Firebase database. Step 1 - Server Script to collect data Digest an XML feed and transfer the data into the Firebase real-time database. The script runs as cronjob frequently to refresh the data. Step 2 - App Component Subscribe to the data from a JavaScript component, in this case, React-Native. About Firebase Now that those two steps are complete, let's take a step back and talk about Google Firebase. Firebase offers a range of services such as a real-time database, authentication, cloud notifications, storage, and much more. You can find the full feature list here. Firebase covers three platforms: iOS, Android, and Web. The server script uses the Firebases JavaScript Web API. Having data in this real-time database allows us to query the data from all three platforms (iOS, Android, Web), and in addition, the real-time database allows us to subscribe (listen) to a database path (query), or to query a path once. Step 1 - Digest XML feed and transfer into Firebase Firebase Set UpThe first thing you need to do is to set up a Google Firebase project here In the app, click on "Add another App" and choose Web, a pop-up will show you the configuration. You can copy paste your config into the example script. Now you need to set the rules for your Firebase database. You should make yourself familiar with the database access rules. In my example, the path latestMarkets/ is open for write and read. In a real-world production app, you would have to secure this, having authentication for the write permissions. Here are the database rules to get started: { "rules": { "users": { "$uid": { ".read": "$uid === auth.uid", ".write": "$uid === auth.uid" } }, "latestMarkets": { ".read": true, ".write": true } } } The Server Script Code The XML feed contains stock market data and is frequently changing, except on the weekend. To build the server script, some NPM packages are needed: Firebase Request xml2json babel-preset-es2015 Require modules and configure Firebase web api: const Firebase = require('firebase'); const request = require('request'); const parser = require('xml2json'); // firebase access config const config = { apiKey: "apikey", authDomain: "authdomain", databaseURL: "dburl", storageBucket: "optional", messagingSenderId: "optional" } // init firebase Firebase.initializeApp(config) [/Code] I write JavaScript code in ES6. It is much more fun. It is a simple script, so let's have a look at the code that is relevant to Firebase. The code below is inserting or overwriting data in the database. For this script, I am happy to overwrite data: Firebase.database().ref('latestMarkets/'+value.Symbol).set({ Symbol: value.Symbol, Bid: value.Bid, Ask: value.Ask, High: value.High, Low: value.Low, Direction: value.Direction, Last: value.Last }) .then((response) => { // callback callback(true) }) .catch((error) => { // callback callback(error) }) Firebase Db first references the path: Firebase.database().ref('latestMarkets/'+value.Symbol) And then the action you want to do: // insert/overwrite (promise) Firebase.database().ref('latestMarkets/'+value.Symbol).set({}).then((result)) // get data once (promise) Firebase.database().ref('latestMarkets/'+value.Symbol).once('value').then((snapshot)) // listen to db path, get data on change (callback) Firebase.database().ref('latestMarkets/'+value.Symbol).on('value', ((snapshot) => {}) // ...... Here is the Github repository: Displaying the data in a React-Native app This code below will listen to a database path, on data change, all connected devices will synchronise the data: Firebase.database().ref('latestMarkets/').on('value', snapshot => { // do something with snapshot.val() }) To close the listener, or unsubscribe the path, one can use "off": Firebase.database().ref('latestMarkets/').off() I’ve created an example react-native app to display the data: The Github repository Conclusion In mobile app development, one big question is: "What database and cache solution can I use to provide online and offline capabilities?" One way to look at this question is like you are starting a project from scratch. If so, you can fit your data into Firebase, and then this would be a great solution for you. Additionally, you can use it for both web and mobile apps. The great thing is that you don't need to write a particular API, and you can access data straight from JavaScript. On the other hand, if you have a project that uses MySQL for example, the Firebase real-time database won't help you much. You would need to have a remote API to connect to your database in this case. But even if using the Firebase database isn't a good fit for your project, there are still other features, such as Firebase Storage or Cloud Messaging, which are very easy to use, and even though they are beyond the scope of this post, they are worth checking out. About the author Oliver Blumanski is a developer based out of Townsville, Australia. He has been a software developer since 2000, and can be found on GitHub at @blumanski.
Read more
  • 0
  • 0
  • 16942
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-sequence-generator-transformation-informatica-powercenter
Savia Lobo
12 Dec 2017
7 min read
Save for later

How to integrate Sequence Generator transformation in Informatica PowerCenter 10.x

Savia Lobo
12 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rahul Malewar titled Learning Informatica PowerCenter 10.x. This book will guide you through core functionalities offered by Informatica PowerCenter version 10.x. [/box] Our article covers how sequence generators are used to generate primary key values or a sequence of unique numbers. It also showcases the ports associated and the properties of sequence generators. A sequence Generator transformation is used to generate a sequence of unique numbers. Based on the property defined in the Sequence Generator transformation, the unique values are generated. A sample mapping showing the Sequence Generator transformation in shown in the following screenshot: As you can notice in the mapping, the Sequence Generator transformation does not have any input port. You need to define the Start value, Increment by value, and End value in the properties. Based on properties, the sequence generator generates the value. In the preceding mapping, as soon as the first record enters the target from the Source Qualifier transformation, NEXTVAL generates its first value and so on for other records. The sequence Generator is built for generating numbers. Ports of Sequence Generator transformation Sequence Generator transformation has only two ports, namely NEXTVAL and CURRVAL. Both the ports are output ports. You cannot add or delete any port in the Sequence Generator. It is recommended that you always use the NEXTVAL port first. If the NEXTVAL port is utilized, then use the CURRVAL port. You can define the value of CURRVAL in the properties of the Sequence Generator transformation. Consider a scenario where we are passing two records to transformation; the following events occur inside the Sequence Generator transformation. Also, note that in our case, we have defined the Start value as 0, the Increment by value as 1, and the End value is default in the property. Also, the current value defined in properties is 1. The following is the sequence of events: When the first record enters the target from the Filter transformation, the current value, which is set to 1 in the properties of the Sequence Generator, is assigned to the NEXTVAL port. This gets loaded into the target by the connected link. So, for the first record, SEQUENCE_NO in the target gets the value as 1. The Sequence Generator increments CURRVAL internally and assigns that value to the current value, which is 2 in this case. When the second record enters the target, the current value, which is set as 2, now gets assigned to NEXTVAL. The Sequence Generator increments CURRVAL internally to make it 3. So at the end of the processing of record 2, the NEXTVAL port will have value 2 and the CURRVAL port will have value set as 3. This is how the cycle keeps on running till you reach the end of the records from the source. It is a little confusing to understand how the NEXTVAL and CURRVAL ports are behaving, but after reading the preceding example, you will have a proper understanding of the process. Properties of Sequence Generator transformation There are multiple values that you need to define inside the Sequence Generator transformation. Double-click on the Sequence Generator, and click on the Properties tab as shown in the following screenshot: Let's discuss the properties in detail: Start Value:This comes into the picture only if you select the Cycle option in the properties. It indicates the integration service to start over from this value as soon as the end value is reached when you have checked the cycle option. The default value is 0, and the maximum value is 9223372036854775806. Increment By: This is the value by which you wish to increment the consecutive numbers from the NEXTVAL port. The default value is 1, and the maximum value is 2147483647. End Value: This is the maximum value Integration service can generate. If the Sequence Generator reaches the end value and if it is not configured for the cycle, the session will fail, giving the error as data overflow. The maximum value is 9223372036854775807. Current Value: This indicates the value assigned to the CURRVAL port. Specify the current value that you wish to have for the first record. As mentioned in the preceding section, the CURRVAL port gets assigned to NEXTVAL, and the CURRVAL port is incremented. The CURRVAL port stores the value after the session is over, and when you run the session the next time, it starts incrementing the value from the stored value if you have not checked the reset option. If you check the reset option, Integration service resets the value to 1. Suppose you have not checked the Reset option and you have passed 17 records at the end of the session, the current value will be set to 18, which will be stored internally. When you run the session the next time, it starts generating the value from 18. The maximum value is 9223372036854775807. Cycle:If you check this option, Integration services cycles through the sequence defined. If you do not check this option, the process stops at the defined End Value. If your source records are more than the End value defined, the session will fail with the overflow error. Number of Cached Values: This option indicates how many sequential values Integration service can cache at a time. This option is useful only when you are using reusable sequence generator transformation. The default value for non-reusable transformation is 0. The default value for reusable transformation is 1,000. The maximum value is 9223372036854775807. Reset:If you do not check this option, the Integration services store the value of the previous run and generate the value from the stored value. Else, the Integration will reset to the defined current value and generate values from the initial value defined. This property is disabled for reusable Sequence Generator transformation. Tracing Level:This indicates the level of detail you wish to write into the session log. We will discuss this option in detail later in the chapter. With this, we have seen all the properties of the Sequence Generator transformation. Let's talk about the usage of Sequence Generator transformation: Generating Primary/Foreign Key:Sequence Generator can be used to generate primary key and foreign key. Primary key and foreign key should be unique and not null; the Sequence Generator transformation can easily do this as seen in the preceding section. Connect the NEXTVAL port to both the targets for which you wish to generate the primary and foreign keys as shown in the following screenshot: Replace missing values: You can use the Sequence Generator transformation to replace missing values by using the IIF and ISNULL functions. Consider you have some data with JOB_ID of Employee. Some records do not have JOB_ID in the table. Use the following function to replace those missing values: IIF( ISNULL (JOB_ID), NEXTVAL, JOB_ID) The preceding function interprets if the JOB_ID is null, then assign NEXTVAL, else keep JOB_ID as it is. The following screenshot indicates the preceding requirement: With this, you have learned all the options present in the Sequence Generator transformation. If you liked the above article, checkout our book Learning Informatica PowerCenter 10.x to explore Informatica PowerCenter’s other powerful features such as working on sources, targets, performance optimization, scheduling, deploying for processing, and managing your data at speed.  
Read more
  • 0
  • 0
  • 16929

article-image-building-recommendation-system-with-scala-and-apache-spark-tutorial
Savia Lobo
08 Sep 2018
12 min read
Save for later

Building Recommendation System with Scala and Apache Spark [Tutorial]

Savia Lobo
08 Sep 2018
12 min read
Recommendation systems can be defined as software applications that draw out and learn from data such as preferences, their actions (clicks, for example), browsing history, and generated recommendations, which are products that the system determines are appealing to the user in the immediate future. In this tutorial, we will learn to build a recommendation system with Scala and Apache Spark. This article is an excerpt taken from Modern Scala Projects written Ilango Gurusamy. What does a recommendation system look like The following diagram is representative of a typical recommendation system: Recommendation system In the preceding diagram, can be thought of as a recommendation ecosystem, where the recommendation system is at the heart of it. This system needs three entities: Users Products Transactions between users and products where transactions contain feedback from users about products Implementation and deployment Implementation is documented in the following subsections. All code is developed in an Intellij code editor. The very first step is to create an empty Scala project called Chapter7. Step 1 – creating the Scala project Let's create a Scala project called Chapter7 with the following artifacts: RecommendationSystem.scala RecommendationWrapper.scala Let's break down the project's structure: .idea: Generated IntelliJ configuration files. project: Contains build.properties and plugins.sbt. project/assembly.sbt: This file specifies the sbt-assembly plugin needed to build a fat JAR for deployment. src/main/scala: This is a folder that houses Scala source files in the com.packt.modern.chapter7 package. target: This is where artifacts of the compile process are stored. The generated assembly JAR file goes here. build.sbt: This is the main SBT configuration file. Spark and its dependencies are specified here. At this point, we will start developing code in the IntelliJ code editor. We will start with the AirlineWrapper Scala file and end with the deployment of the final application JAR into Spark with spark-submit. Step 2 – creating the AirlineWrapper definition Let's create the trait definition. The trait will hold the SparkSession variable, schema definitions for the datasets, and methods to build a dataframe: trait RecWrapper { } Next, let's create a schema for past weapon sales orders. Step 3 – creating a weapon sales orders schema Let's create a schema for the past sales order dataset: val salesOrderSchema: StructType = StructType(Array( StructField("sCustomerId", IntegerType,false), StructField("sCustomerName", StringType,false), StructField("sItemId", IntegerType,true), StructField("sItemName", StringType,true), StructField("sItemUnitPrice",DoubleType,true), StructField("sOrderSize", DoubleType,true), StructField("sAmountPaid", DoubleType,true) )) Next, let's create a schema for weapon sales leads. Step 4 – creating a weapon sales leads schema Here is a schema definition for the weapon sales lead dataset: val salesLeadSchema: StructType = StructType(Array( StructField("sCustomerId", IntegerType,false), StructField("sCustomerName", StringType,false), StructField("sItemId", IntegerType,true), StructField("sItemName", StringType,true) )) Next, let's build a weapon sales order dataframe. Step 5 – building a weapon sales order dataframe Let's invoke the read method on our SparkSession instance and cache it. We will call this method later from the RecSystem object: def buildSalesOrders(dataSet: String): DataFrame = { session.read .format("com.databricks.spark.csv") .option("header", true).schema(salesOrderSchema).option("nullValue", "") .option("treatEmptyValuesAsNulls", "true") .load(dataSet).cache() } Next up, let's build a sales leads dataframe: def buildSalesLeads(dataSet: String): DataFrame = { session.read .format("com.databricks.spark.csv") .option("header", true).schema(salesLeadSchema).option("nullValue", "") .option("treatEmptyValuesAsNulls", "true") .load(dataSet).cache() } This completes the trait. Overall, it looks like this: trait RecWrapper { 1) Create a lazy SparkSession instance and call it session. 2) Create a schema for the past sales orders dataset 3) Create a schema for sales lead dataset 4) Write a method to create a dataframe that holds past sales order data. This method takes in sales order dataset and returns a dataframe 5) Write a method to create a dataframe that holds lead sales data } Bring in the following imports: import org.apache.spark.mllib.recommendation.{ALS, Rating} import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} Create a Scala object called RecSystem: object RecSystem extends App with RecWrapper { } Before going any further, bring in the following imports: import org.apache.spark.rdd.RDD import org.apache.spark.sql.DataFrame Inside this object, start by loading the past sales order data. This will be our training data. Load the sales order dataset, as follows: val salesOrdersDf = buildSalesOrders("sales\\PastWeaponSalesOrders.csv") Verify the schema. This is what the schema looks like: salesOrdersDf.printSchema() root |-- sCustomerId: integer (nullable = true) |-- sCustomerName: string (nullable = true) |-- sItemId: integer (nullable = true) |-- sItemName: string (nullable = true) |-- sItemUnitPrice: double (nullable = true) |-- sOrderSize: double (nullable = true) |-- sAmountPaid: double (nullable = true) Here is a partial view of a dataframe displaying past weapon sales order data: Partial view of dataframe displaying past weapon sales order data Now, we have what we need to create a dataframe of ratings: val ratingsDf: DataFrame = salesOrdersDf.map( salesOrder => Rating( salesOrder.getInt(0), salesOrder.getInt(2), salesOrder.getDouble(6) ) ).toDF("user", "item", "rating") Save all and compile the project at the command line: C:\Path\To\Your\Project\Chapter7>sbt compile You are likely to run into the following error: [error] C:\Path\To\Your\Project\Chapter7\src\main\scala\com\packt\modern\chapter7\RecSystem.scala:50:50: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. [error] val ratingsDf: DataFrame = salesOrdersDf.map( salesOrder => [error] ^ [error] two errors found [error] (compile:compileIncremental) Compilation failed To fix this, place the following statement at the top of the declarations of the rating dataframe. It should look like this: import session.implicits._ val ratingsDf: DataFrame = salesOrdersDf.map( salesOrder => UserRating( salesOrder.getInt(0), salesOrder.getInt(2), salesOrder.getDouble(6) ) ).toDF("user", "item", "rating") Save and recompile the project. This time, it compiles just fine. Next, import the Rating class from the org.apache.spark.mllib.recommendation package. This transforms the rating dataframe that we obtained previously to its RDD equivalent: val ratings: RDD[Rating] = ratingsDf.rdd.map( row => Rating( row.getInt(0), row.getInt(1), row.getDouble(2) ) ) println("Ratings RDD is: " + ratings.take(10).mkString(" ") ) The following few lines of code are very important. We will be using the ALS algorithm from Spark MLlib to create and train a MatrixFactorizationModel, which takes an RDD[Rating] object as input. The ALS train method may require a combination of the following training hyperparameters: numBlocks: Preset to -1 in an auto-configuration setting. This parameter is meant to parallelize computation. custRank: The number of features, otherwise known as latent factors. iterations: This parameter represents the number of iterations for ALS to execute. For a reasonable solution to converge on, this algorithm needs roughly 20 iterations or less. regParam: The regularization parameter. implicitPrefs: This hyperparameter is a specifier. It lets us use either of the following: Explicit feedback Implicit feedback alpha: This is a hyperparameter connected to an implicit feedback variant of the ALS algorithm. Its role is to govern the baseline confidence in preference observations. We just explained the role played by each parameter needed by the ALS algorithm's train method. Let's get started by bringing in the following imports: import org.apache.spark.mllib.recommendation.MatrixFactorizationModel Now, let's get down to training the matrix factorization model using the ALS algorithm. Let's train a matrix factorization model given an RDD of ratings by customers (users) for certain items (products). Our train method on the ALS algorithm will take the following four parameters: Ratings. A rank. A number of iterations. A Lambda value or regularization parameter: val ratingsModel: MatrixFactorizationModel = ALS.train(ratings, 6, /* THE RANK */ 10, /* Number of iterations */ 15.0 /* Lambda, or regularization parameter */ ) Next, we load the sales lead file and convert it into a tuple format: val weaponSalesLeadDf = buildSalesLeads("sales\\ItemSalesLeads.csv") In the next section, we will display the new weapon sales lead dataframe. Step 6 – displaying the weapons sales dataframe First, we must invoke the show method: println("Weapons Sales Lead dataframe is: ") weaponSalesLeadDf.show   Here is a view of the weapon sales lead dataframe: View of weapon sales lead dataframe Next, create a version of the sales lead dataframe structured as (customer, item) tuples: val customerWeaponsSystemPairDf: DataFrame = weaponSalesLeadDf.map(salesLead => ( salesLead.getInt(0), salesLead.getInt(2) )).toDF("user","item") In the next section, let's display the dataframe that we just created. Step 7 – displaying the customer-weapons-system dataframe Let's the show method, as follows: println("The Customer-Weapons System dataframe as tuple pairs looks like: ") customerWeaponsSystemPairDf.show   Here is a screenshot of the new customer-weapons-system dataframe as tuple pairs: New customer-weapons-system dataframe as tuple pairs Next, we will convert the preceding dataframe into an RDD: val customerWeaponsSystemPairRDD: RDD[(Int, Int)] = customerWeaponsSystemDf.rdd.map(row => (row.getInt(0), row.getInt(1)) ) /* Notes: As far as the algorithm is concerned, customer corresponds to "user" and "product" or item corresponds to a "weapons system" */ We previously created a MatrixFactorization model that we trained with the weapons system sales orders dataset. We are in a position to predict how each customer country may rate a weapon system in the future. In the next section, we will generate predictions.   Step 8 – generating predictions Here is how we will generate predictions. The predict method of our model is designed to do just that. It will generate a predictions RDD that we call weaponRecs. It represents the ratings of weapons systems that were not rated by customer nations (listed in the past sales order data) previously: val weaponRecs: RDD[Rating] = ratingsModel.predict(customerWeaponsSystemPairRDD).distinct() Next up, we will display the final predictions. Step 9 – displaying predictions Here is how to display the predictions, lined up in tabular format: println("Future ratings are: " + weaponRecs.foreach(rating => { println( "Customer: " + rating.user + " Product: " + rating.product + " Rating: " + rating.rating ) } ) ) The following table displays how each nation is expected to rate a certain system in the future, that is, a weapon system that they did not rate earlier: System rating by each nation Our recommendation system proved itself capable of generating future predictions. Up until now, we did not say how all of the preceding code is compiled and deployed. We will look at this in the next section. Compilation and deployment Compiling the project Invoke the sbt compile project at the root folder of your Chapter7 project. You should get the following output: Output on compiling the project Besides loading build.sbt, the compile task is also loading settings from assembly.sbt which we will create below. What is an assembly.sbt file? We have not yet talked about the assembly.sbt file. Our scala-based Spark application is a Spark job that will be submitted to a (local) Spark cluster as a JAR file. This file, apart from Spark libraries, also needs other dependencies in it for our recommendation system job to successfully complete. The name fat JAR is from all dependencies bundled in one JAR. To build such a fat JAR, we need an sbt-assembly plugin. This explains the need for creating a new assembly.sbt and the assembly plugin. Creating assembly.sbt Create a new assembly.sbt in your IntelliJ project view and save it under your project folder, as follows: Creating assembly.sbt Contents of assembly.sbt Paste the following contents into the newly created assembly.sbt (under the project folder). The output should look like this: Output on placing contents of assembly.sbt The sbt-assembly plugin, version 0.14.7, gives us the ability to run an sbt-assembly task. With that, we are one step closer to building a fat or Uber JAR. This action is documented in the next step. Running the sbt assembly task Issue the sbt assembly command, as follows: Running the sbt assembly command This time, the assembly task loads the assembly-plugin in assembly.sbt. However, further assembly halts because of a common duplicate error. This error arises due to several duplicates, multiple copies of dependency files that need removal before the assembly task can successfully complete. To address this situation, build.sbt needs an upgrade. Upgrading the build.sbt file The following lines of code need to be added in, as follows: Code lines for upgrading the build.sbt file To test the effect of your changes, save this and go to the command line to reissue the sbt assembly task. Rerunning the assembly command Run the assembly task, as follows: Rerunning the assembly task This time, the settings in the assembly.sbt file are loaded. The task completes successfully. To verify, drill down to the target folder. If everything went well, you should see a fat JAR, as follows: Output as a JAR file Our JAR file under the target folder is the recommendation system application's JAR file that needs to be deployed into Spark. This is documented in the next step. Deploying the recommendation application The spark-submit command is how we will deploy the application into Spark. Here are two formats for the spark-submit command. The first one is a long one which sets more parameters than the second one: spark-submit --class "com.packt.modern.chapter7.RecSystem" --master local[2] --deploy-mode client --driver-memory 16g -num-executors 2 --executor-memory 2g --executor-cores 2 <path-to-jar> Leaning on the preceding format, let's submit our Spark job, supplying various parameters to it: Parameters for Spark The different parameters are explained as follows:    Tabular explanation of parameters for Spark Job We used Spark's support for recommendations to build a prediction model that generated recommendations and leveraged Spark's alternating least squares algorithm to implement our collaborative filtering recommendation system. If you've enjoyed reading this post, do check out the book  Modern Scala Projects to gain insights into data that will help organizations have a strategic and competitive advantage. How to Build a music recommendation system with PageRank Algorithm Recommendation Systems Building A Recommendation System with Azure
Read more
  • 0
  • 0
  • 16905

article-image-data-bindings-with-knockout-js
Vijin Boricha
23 Apr 2018
7 min read
Save for later

Data bindings with Knockout.js

Vijin Boricha
23 Apr 2018
7 min read
Today, we will learn about three data binding abilities of Knockout.js. Data bindings are attributes added by the framework for the purpose of data access between elements and view scope. While Observable arrays are efficient in accessing the list of objects with the number of operations on top of the display of the list using the foreach function, Knockout.js has provided three additional data binding abilities: Control-flow bindings Appearance bindings Interactive bindings Let us review these data bindings in detail in the following sections. Control-flow bindings As the name suggests, control-flow bindings help us access the data elements based on a certain condition. The if, if-not, and with are the control-flow bindings available from the Knockout.js. In the following example, we will be using if and with control-flow bindings. We have added a new attribute to the Employee object called age; we are displaying the age value in green only if it is greater than 20. Similarly, we have added another markedEmployee. With control-flow binding, we can limit the scope of access to that specific employee object in the following paragraph. Add the following code snippet to index.html and run the program to see the if and with control-flow bindings working: <!DOCTYPE html> <html> <head> <title>Knockout JS</title> </head> <body> <h1>Welcome to Knockout JS programming</h1> <table border="1" > <tr > <th colspan="2" style="padding:10px;"> <b>Employee Data - Organization : <span style="color:red" data-bind='text: organizationName'> </span> </b> </th> </tr> <tr> <td style="padding:10px;">Employee First Name:</td> <td style="padding:10px;"> <span data-bind='text: empFirstName'></span> </td> </tr> <tr> <td style="padding:10px;">Employee Last Name:</td> <td style="padding:10px;"> <span data-bind='text: empLastName'></span> </td> </tr> </table> <p>Organization Full Name : <span style="color:red" data-bind='text: orgFullName'></span> </p> <!-- Observable Arrays--> <h2>Observable Array Example : </h2> <table border="1"> <thead><tr> <th style="padding:10px;">First Name</th> <th style="padding:10px;">Last Name</th> <th style="padding:10px;">Age</th> </tr></thead> <tbody data-bind='foreach: organization'> <tr> <td style="padding:10px;" data-bind='text: firstName'></td> <td style="padding:10px;" data-bind='text: lastName'></td> <td data-bind="if: age() > 20" style="color: green;padding:10px;"> <span data-bind='text:age'></span> </td> </tr> </tbody> </table> <!-- with control flow bindings --> <p data-bind='with: markedEmployee'> Employee <strong data-bind="text: firstName() + ', ' + lastName()"> </strong> is marked with the age <strong data-bind='text: age'> </strong> </p> <h2>Add New Employee to Observable Array</h2> First Name : <input data-bind="value: newFirstName" /> Last Name : <input data-bind="value: newLastName" /> Age : <input data-bind="value: newEmpAge" /> <button data-bind='click: addEmployee'>Add Employee</button> <!-- JavaScript resources --> <script type='text/javascript' src='js/knockout-3.4.2.js'></script> <script type='text/javascript'> function Employee (firstName, lastName,age) { this.firstName = ko.observable(firstName); this.lastName = ko.observable(lastName); this.age = ko.observable(age); }; this.addEmployee = function() { this.organization.push(new Employee (employeeViewModel.newFirstName(), employeeViewModel.newLastName(), employeeViewModel.newEmpAge())); }; var employeeViewModel = { empFirstName: "Tony", empLastName: "Henry", //Observable organizationName: ko.observable("Sun"), newFirstName: ko.observable(""), newLastName: ko.observable(""), newEmpAge: ko.observable(""), //With control flow object markedEmployee: ko.observable(new Employee("Garry", "Parks", "65")), //Observable Arrays organization : ko.observableArray([ new Employee("John", "Kennedy", "24"), new Employee("Peter", "Hennes","18"), new Employee("Richmond", "Smith","54") ]) }; //Computed Observable employeeViewModel.orgFullName = ko.computed(function() { return employeeViewModel.organizationName() + " Limited"; }); ko.applyBindings(employeeViewModel); employeeViewModel.organizationName("Oracle"); </script> </body> </html> Run the preceding program to see the if control-flow acting on the Age field, and the with control-flow showing a marked employee record with age 65: Appearance bindings Appearance bindings deal with displaying the data from binding elements on view components in formats such as text and HTML, and applying styles with the help of a set of six bindings, as follows: Text: <value>—Sets the value to an element. Example: <td data-bind='text: name'></td> HTML: <value>—Sets the HTML value to an element. Example: //JavaScript: function Employee(firstname, lastname, age) { ... this.formattedName = ko.computed(function() { return "<strong>" + this.firstname() + "</strong>"; }, this); } //Html: <span data-bind='html: markedEmployee().formattedName'></span> Visible: <condition>—An element can be shown or hidden based on the condition. Example: <td data-bind='visible: age() > 20' style='color: green'> span data-bind='text:age'> CSS: <object>—An element can be associated with a CSS class. Example: //CSS: .strongEmployee { font-weight: bold; } //HTML: <span data-bind='text: formattedName, css: {strongEmployee}'> </span> Style: <object>—Associates an inline style to the element. Example: <span data-bind='text: age, style: {color: age() > 20 ? "green" :"red"}'> </span> Attr: <object>—Defines an attribute for the element. Example: <p><a data-bind='attr: {href: featuredEmployee().populatelink}'> View Employee</a></p> Interactive bindings Interactive bindings help the user interact with the form elements to be associated with corresponding viewmodel methods or events to be triggered in the pages. Knockout JS supports the following interactive bindings: Click: <method>—An element click invokes a ViewModel method. Example: <button data-bind='click: addEmployee'>Submit</button> Value:<property>—Associates the form element value to the ViewModel attribute. Example: <td>Age: <input data-bind='value: age' /></td> Event: <object>—With an user-initiated event, it invokes a method. Example: <p data-bind='event: {mouseover: showEmployee, mouseout: hideEmployee}'> Age: <input data-bind='value: Age' /> </p> Submit: <method>—With a form submit event, it can invoke a method. Example: <form data-bind="submit: addEmployee"> <!—Employee form fields --> <button type="submit">Submit</button> </form> Enable: <property>—Conditionally enables the form elements. Example: last name field is enabled only after adding first name field. Disable: <property>—Conditionally disables the form elements. Example: last name field is disabled after adding first name: <p>Last Name: <input data-bind='value: lastName, disable: firstName' /> </p> Checked: <property>—Associates a checkbox or radio element to the ViewModel attribute. Example: <p>Gender: <input data-bind='checked:gender' type='checkbox' /></p> Options: <array>—Defines a ViewModel array for the<select> element. Example: //Javascript: this.designations = ko.observableArray(['manager', 'administrator']); //Html: Designation: <select data-bind='options: designations'></select> selectedOptions: <array>—Defines the active/selected element from the <select> element. Example: Designation: <select data-bind='options: designations, optionsText:"Select", selectedOptions:defaultDesignation'> </select> hasfocus: <property>—Associates the focus attribute to the element. Example: First Name: <input data-bind='value: firstName, hasfocus: firstNameHasFocus' /> We learned about data binding abilities of Knockout.js. You can know more about external data access and Hybrid Mobile Application Development from the book Oracle JET for Developers. Read More Text and appearance bindings and form field bindings Getting to know KnockoutJS Templates    
Read more
  • 0
  • 0
  • 16888

article-image-installing-and-configuring-network-monitoring-software
Packt
02 Jun 2015
9 min read
Save for later

Installing and Configuring Network Monitoring Software

Packt
02 Jun 2015
9 min read
This article written by Bill Pretty, Glenn Vander Veer, authors of the book Building Networks and Servers Using BeagleBone will serve as an installation guide for the software that will be used to monitor the traffic on your local network. These utilities can help determine which devices on your network are hogging the bandwidth, which slows down the network for other devices on your network. Here are the topics that we are going to cover: Installing traceroute and My Trace Route (MTR or Matt's Traceroute): These utilities will give you a real-time view of the connection between one node and another Installing Nmap: This utility is a network scanner that can list all the hosts on your network and all the services available on those hosts Installing iptraf-ng: This utility gathers various network traffic information and statistics (For more resources related to this topic, see here.) Installing Traceroute Traceroute is a tool that can show the path from one node on a network to another. This can help determine the ideal placement of a router to maximize wireless bandwidth in order to stream music and videos from the BeagleBone server to remote devices. Traceroute can be installed with the following command: apt-get install traceroute   Once Traceroute is installed, it can be run to find the path from the BeagleBone to any server anywhere in the world. For example, here's the route from my BeagelBone to the Canadian Google servers: Now, it is time to decipher all the information that is presented. This first command line tells traceroute the parameters that it must use: traceroute to google.ca (74.125.225.23), 30 hops max, 60 byte packets This gives the hostname, the IP address returned by the DNS server, the maximum number of hops to be taken, and the size of the data packet to be sent. The maximum number of hops can be changed with the –m flag and can be up to 255. In the context of this book, this will not have to be changed. After the first line, the next few lines show the trip from the BeagleBone, through the intermediate hosts (or hops), to the Google.ca server. Each line follows the following format: hop_number host_name (host IP_address) packet_round_trip_times From the command that was run previously (specifically hop number 4): 2 10.149.206.1 (10.149.206.1) 15.335 ms 17.319 ms 17.232 ms Here's a breakdown of the output: The hop number 2: This is a count of the number of hosts between this host and the originating host. The higher the number, the greater is the number of computers that the traffic has to go through to reach its destination. 10.149.206.1: This denotes the hostname. This is the result of a reverse DNS lookup on the IP address. If no information is returned from the DNS query (as in this case), the IP address of the host is given instead. (10.149.206.1): This is the actual host IP address. Various numbers: This is the round-trip time for a packet to go from the BeagleBone to the server and back again. These numbers will vary depending on network traffic, and lower is better. Sometimes, the traceroute will return some asterisks (*). This indicates that the packet has not been acknowledged by the host. If there are consecutive asterisks and the final destination is not reached, then there may be a routing problem. In a local network trace, it most likely is a firewall that is blocking the data packet. Installing My Traceroute My Traceroute (MTR) is an extension of traceroute, which probes the routers on the path from the packet source and destination, and keeps track of the response times of the hops. It does this repeatedly so that the response times can be averaged. Now, install mtr with the following command: sudo apt-get install mtr After it is run, mtr will provide quite a bit more information to look at, which would look like the following: While the output may look similar, the big advantage over traceroute is that the output is constantly updated. This allows you to accumulate trends and averages and also see how network performance varies over time. When using traceroute, there is a possibility that the packets that were sent to each hop happened to make the trip without incident, even in a situation where the route is suffering from intermittent packet loss. The mtr utility allows you to monitor this by gathering data over a wider range of time. Here's an mtr trace from my Beaglebone to my Android smartphone: Here's another trace, after I changed the orientation of the antennae of my router: As you can see, the original orientation was almost 100 milliseconds faster for ping traffic. Installing Nmap Nmap is designed to allow the scanning of networks in order to determine which hosts are up and what services are they offering. Nmap supports a large number of scanning options, which are overkill for what will be done in this book. Nmap is installed with the following command: sudo apt-get install nmap Answer Yes to install nmap and its dependent packages. Using Nmap After it is installed, run the following command to see all the hosts that are currently on the network: nmap –T4 –F <your_local_ip_range> The option -T4 sets the timing template to be used, and the -F option is for fast scanning. There are other options that can be used and found via the nmap manpage. Here, your_local_ip_range is within the range of addresses assigned by your router. Here's a node scan of my local network. If you have a lot of devices on your local network, this command may take a long time to complete. Now, I know that I have more nodes on my network, but they don't show up. This is because the command we ran didn't tell nmap to explicitly query each IP address to see whether the host responds but to query common ports that may be open to traffic. Instead, only use the -Pn option in the command to tell nmap to scan all the ports for every address in the range. This will scan more ports on each address to determine whether the host is active or not. Here, we can see that there are definitely more hosts registered in the router device table. This scan will attempt to scan a host IP address even if the device is powered off. Resetting the router and running the same scan will scan the same address range, but it will not return any device names for devices that are not powered at the time of the scan. You will notice that after scanning, nmap reports that some IP addresses' ports are closed and some are filtered. Closed ports are usually maintained on the addresses of devices that are locked down by their firewall. Filtered ports are on the addresses that will be handled by the router because there actually isn't a node assigned to these addresses. Here's a part of the output from an nmap scan of my Windows machine: Here's a part of the output of a scan of the BeagleBone: Installing iptraf-ng Iptraf-ng is a utility that monitors traffic on any of the interfaces or IP addresses on your network via custom filters. Because iptraf-ng is based on the ncurses libraries, we will have to install them first before downloading and compiling the actual iptraf-ng package. To install ncurses, run the following command: sudo apt-get install libncurses5-dev Here's how you will install ncurses and its dependent packages: Once ncurses is installed, download and extract the iptraf-ng tarball so that it can be built. At the time of writing this book, iptrf-ng's version 1.1.4 was available. This will change over time, and a quick search on Google will give you the latest and greatest version to download. You can download this version with the following command: wget https://fedorahosted.org/releases/i/p/iptraf-ng/iptraf-ng- <current_version_number>.tar.gz The following screenshot shows how to download the iptraf-ng tarball: After we have completed the downloading, extract the tarball using the following command: tar –xzf iptraf-ng-<current_version_number>.tar.gz Navigate to the iptraf-ng directory created by the tar command and issue the following commands: ./configure make sudo make install After these commands are complete, iptraf-ng is ready to run, using the following command: sudo iptraf-ng When the program starts, you will be presented with the following screen: Configuring iptraf-ng As an example, we are going to monitor all incoming traffic to the BeagleBone. In order to do this, iptraf-ng should be configured. Selecting the Configure... menu item will show you the following screen: Here, settings can be changed by highlighting an option in the left-hand side window and pressing Enter to select a new value, which will be shown in the Current Settings window. In this case, I have enabled all the options except Logging. Exit the configuration screen and enter the Filter Status screen. This is where we will set up the filter to only monitor traffic coming to the BeagleBone and from it. Then, the following screen will be presented: Selecting IP... will create an IP filter, and the following subscreen will pop up: Selecting Define new filter... will allow the creation and saving of a filter that will only display traffic for the IP address and the IP protocols that are selected, as shown in the following screenshot: Here, I have put in the BeagleBone's IP address, and to match all IP protocols. Once saved, return to the main menu and select IP traffic monitor. Here, you will be able to select the network interfaces to be monitored. Because my BeagleBone is connected to my wired network, I have selected eth0. The following screenshot should shows us the options: If all went well with your filter, you should see traffic to your BeagleBone and from it. Here are the entries for my PuTTy session; 192.168.17.2 is my Windows 8 machine, and 192.168.17.15 is my BeagleBone: Here's an image of the traffic generated by browsing the DLNA server from the Windows Explorer: Moreover, here's the traffic from my Android smartphone running a DLNA player, browsing the shared directories that were set up: Summary In this article, you saw how to install and configure the software that will be used to monitor the traffic on your local network. With these programs and a bit of experience, you can determine which devices on your network are hogging the bandwidth and find out whether you have any unauthorized users. Resources for Article: Further resources on this subject: Learning BeagleBone [article] Protecting GPG Keys in BeagleBone [article] Home Security by BeagleBone [article]
Read more
  • 0
  • 0
  • 16881
article-image-facebook-fails-to-block-ecj-data-security-case-from-proceeding
Guest Contributor
19 Jun 2019
6 min read
Save for later

Facebook fails to block ECJ data security case from proceeding

Guest Contributor
19 Jun 2019
6 min read
This July, the European Court of Justice (ECJ) in Luxembourg will now hear a case to answer questions on whether the American government's surveillance, Privacy Shield and Standard Contract Clauses, during EU-US data transfers, provides adequate protection of EU citizen's personal information. The ECJ set the case hearing after the supreme court of Ireland — where Facebook's international headquarters is located — decided, on Friday, May 31st, 2019, to dismiss an appeal by Facebook to block the data security case from progressing to the ECJ. The Austrian Supreme Court has also recently rejected Facebook’s bid to stop a similar case. If Europe's Court of Justice makes a ruling against the current legal arrangements, this would majorly impact thousands of companies, which make millions of data transfers every day. Companies potentially affected, include human resources databases, storage of internet browsing histories and credit card companies. Background on this case The case started with the Austrian privacy lawyer and campaigner, Max Schrems. In 2013, Schrems made a complaint regarding concerns that US surveillance programs like the PRISM system were accessing the data of European Facebook users, as whistleblower Edward Snowden described. His concerns also dealt with Facebook’s use of a separate data transfer mechanism — Standard Contractual Clauses (SCCs). Around the time Snowden disclosed about the US government's mass surveillance programs, Schrems also challenged the legality of the prior EU-US data transfer arrangement, Safe Harbor, eventually bringing it down. After Schrems stated that the transfer of his data by Facebook to the US infringed upon his rights as an EU citizen, Ireland's High Court ruled, in 2017, that the US government partook in "mass indiscriminate processing of data" and deferred concerns to the European Court of Justice. Then, in October of last year, the High Court referred this case to the ECJ based on the Data Protection Commissioner's "well-founded" concerns about whether or not US law provided adequate protection for EU citizens' data privacy rights. Within all of this, there also exist people questioning the compatibility between US law which focuses on national security and EU law which aims for personal privacy. Whistleblowers like Edward Snowden played a role in what has lead up to this case, and whistleblower attorneys and paraprofessionals continue working to expose fraud against the government through means of the False Claims Acts (FCA). Why Facebook appealed the case Although Irish law doesn't require an appeal against CJEU referrals, Facebook chose to stay and appeal the decision anyway, aiming to keep it from progressing to court. The court denied them the stay but granted them leave to appeal last year. Keep in mind that Facebook was already under a lot of scrutiny after playing a part in the Cambridge Analytica data scandal, which showed that up to 87 million users faced having their data compromised by Cambridge Analytica. One of the reasons Facebook said it wanted to block this case from progressing was that the High Court failed to regard the 'Privacy Shield' decision. Under the Privacy Shield decision, the European Commission had approved the use of certain EU-US data transfer channels. Another main issue here was whether Facebook actually had the legal rights to appeal a referral to the ECJ. Privacy Shield is also in question by French digital rights groups who claim it disrupts fundamental EU rights and will be heard by the General Court of the EU in July. Why the appeal was dismissed The five-judge High Court, headed by the Chief Justice Frank Clarke, decided they cannot entertain an appeal over the referral decision itself. In addition, he said Facebook’s criticisms related to the “proper characterization” of underlying facts rather than the facts themselves. If there had been any actual finding of facts not sustainable on the evidence before the High Court per Irish procedural law, he would have overturned it, but no such matter had been established on this appeal, he ruled. "Joint Control" and its possible impact on the case In June 2018, after a Facebook fan page was found to have been allowing visitor data to be collected by Facebook via a cookie on the fan page, without informing visitors, The Federal Administrative Court of Germany referred the case to ECJ. This resulted in the ECJ deciding to deem joint responsibility between social media networks and administrators in the processing of visitor data. The ECJ´s ruling, in this case, has consequences not only for Facebook sites but for other situations where more than one company or administrator plays an active role in the data processing. The concept of “joint control” is now on the table, and further decisions of authorities and courts in this area are likely. What's next for data security Currently, Facebook also faces questioning by Ireland's Data Protection Commission over numerous potential infringements of strict European privacy laws that the new General Data Protection Regulation (GDPR) outlines. Facebook, however, already stated it will take the necessary steps to ensure the site operators can comply with the GDPR. There have even been pleas for Global Data Laws. A common misconception exists that only big organizations, governments and businesses are at risk for data security breaches, but this is simply not true. Data security is important for everyone — now more than ever. Your computer, tablet and mobile devices could be affected by attackers for their sensitive information, such as credit card details, banking details and passwords, by way of phishing attacks, malware attacks, ransomware attacks, man-in-the-middle attacks and more. Therefore, bringing continual awareness to these US and global data security issues will enable stricter laws to be put in place. Kayla Matthews writes about big data, cybersecurity and technology. You can find her work on The Week, Information Age, KDnuggets and CloudTweaks, or over at ProductivityBytes.com. Facebook releases Pythia, a deep learning framework for vision and language multimodal research Zuckberg just became the target of the world’s first high profile white hat deepfake op. Can Facebook come out unscathed? US regulators plan to probe Google on anti-trust issues; Facebook, Amazon & Apple also under legal scrutiny
Read more
  • 0
  • 0
  • 16864

article-image-25-startups-machine-learning-differently-2018
Fatema Patrawala
29 Dec 2017
14 min read
Save for later

25 Startups using machine learning differently in 2018: From farming to brewing beer to elder care

Fatema Patrawala
29 Dec 2017
14 min read
What really excites me about data science and by extension machine learning is the sheer number of possibilities! You can think of so many applications off the top of your head: robo-advisors, computerized lawyers, digital medicine, even automating VC decisions when they invest in startups. You can even venture into automation of art and music, algorithms writing papers which are indistinguishable from human-written papers. It's like solving a puzzle, but a puzzle that's meaningful and that has real world implications. The things that we can do today weren’t possible 5 years ago, and this is largely thanks to growth in computational power, data availability, and the adoption of the cloud that made accessing these resources economical for everyone, all key enabling factors for the advancement of Machine learning and AI. Having witnessed the growth of data science as discipline, industries like finance, health-care, education, media & entertainment, insurance, retail as well as energy has left no stone unturned to harness this opportunity. Data science has the capability to offer even more; and we will see the wide range of applications in the future in places haven’t even been explored. In the years to come, we will increasingly see data powered/AI enabled products and services take on roles traditionally handled by humans as they required innately human qualities to successfully perform. In this article we have covered some use cases of Data Science being used differently and start-ups who have practically implemented it: The Nurturer: For elder care The world is aging rather rapidly. According to the World Health Organization, nearly two billion people across the world are expected to be over 60 years old by 2050, a figure that’s more than triple what it was in 2000. In order to adapt to their increasingly aging population, many countries have raised the retirement age, reducing pension benefits, and have started spending more on elderly care. Research institutions in countries like Japan, home to a large elderly population, are focusing their R&D efforts on robots that can perform tasks like lifting and moving chronically ill patients, many startups are working on automating hospital logistics and bringing in virtual assistance. They also offer AI-based virtual assistants to serve as middlemen between nurses and patients, reducing the need for frequent in-hospital visits. Dr Ben Maruthappu, a practising doctor, has brought a change to the world of geriatric care with an AI based app Cera. It is an on-demand platform to aid the elderly in need. The Cera app firmly puts itself in the category of Uber & Amazon, whereby it connects elderly people in need of care with a caregiver in a matter of few hours. The team behind this innovation also plans to use AI to track patients’ health conditions and reduce the number of emergency patients admitted in hospitals. A social companion technology - Elliq created by Intuition Robotics helps older adults stay active and engaged with a proactive social robot that overcomes the digital divide. AliveCor, a leading FDA-cleared mobile heart solution helps save lives, money, and has brought modern healthcare alive into the 21st century. The Teacher: Personalized education platform for lifelong learning With children increasingly using smartphones and tablets and coding becoming a part of national curricula around the world, technology has become an integral part of classrooms. We have already witnessed the rise and impact of education technology especially through a multitude of adaptive learning platforms that allow learners to strengthen their skills and knowledge - CBTs, LMSes, MOOCs and more. And now virtual reality (VR) and artificial intelligence (AI) are gaining traction to provide us with lifelong learning companion that can accompany and support individuals throughout their studies - in and beyond school . An AI based educational platform learns the amount of potential held by each particular student. Based on this data, tailored guidance is provided to fix mistakes and improvise on the weaker areas. A detailed report can be generated by the teachers to help them customise lesson plans to best suit the needs of the student. Take Gruff Davies’ Kwiziq for example. Gruff with his team leverage AI to provide a personalised learning experience for students based on their individual needs. Students registered on the platform get an advantage of an AI based language coach which asks them to solve various micro quizzes. Quiz solutions provided by students are then turned into detailed “brain maps”.  These brain maps are further used to provide tailored instructions and feedback for improvement. Other startup firms like Blippar specialize in Augmented reality for visual and experiential learning. Unelma Platforms, a software platform development company provides state-of-the-art software for higher-education, healthcare and business markets. The Provider: Farming to be more productive, sustainable and advanced Though farming is considered the backbone of many national economies especially in the developing world, there is often an outdated view of it involving a small, family-owned lands where crops are hand harvested. The reality of modern-day farms have had to overhaul operations to meet demand and remain competitively priced while adapting to the ever-changing ways technology is infiltrating all parts of life. Climate change is a serious environmental threat farmers must deal with every season: Strong storms and severe droughts have made farming even more challenging. Additionally lack of agricultural input, water scarcity, over-chemicalization in fertilizers, water & soil pollution or shortage of storage systems has made survival for farmers all the more difficult.   To overcome these challenges, smart farming techniques are the need of an hour for farmers in order to manage resources and sustain in the market. For instance, in a paper published by arXiv, the team explains how they used a technique known as transfer learning to teach the AI how to recognize crop diseases and pest damage.They utilized TensorFlow, to build and train a neural network of their own, which involved showing the AI 2,756 images of cassava leaves from plants in Tanzania. Their efforts were a success, as the AI was able to correctly identify brown leaf spot disease with 98 percent accuracy. WeFarm, SaaS based agritech firm, headquartered in London, aims to bridge the connectivity gap amongst the farmer community. It allows them to send queries related to farming via text message which is then shared online into several languages. The farmer then receives a crowdsourced response from other farmers around the world. In this way, a particular farmer in Kenya can get a solution from someone sitting in Uganda, without having to leave his farm, spend additional money or without accessing the internet. Benson Hill Bio-systems, by Matthew B. Crisp, former President of Agricultural Biotechnology Division, has differentiated itself by bringing the power of Cloud Biology™ to agriculture. It combines cloud computing, big data analytics, and plant biology to inspire innovation in agriculture. At the heart of Benson Hill is CropOS™, a cognitive engine that integrates crop data and analytics with the biological expertise and experience of the Benson Hill scientists. CropOS™ continuously advances and improves with every new dataset, resulting in the strengthening of the system’s predictive power. Firms like Plenty Inc and Bowery Farming Inc are nowhere behind in offering smart farming solutions. Plenty Inc is an agriculture technology company that develops plant sciences for crops to flourish in a pesticide and GMO-free environment. While Bowery Farming uses high-tech approaches such as robotics, LED lighting and data analytics to grow leafy greens indoors. The Saviour: For sustainability and waste management The global energy landscape continues to evolve, sometimes by the nanosecond, sometimes by the day. The sector finds itself pulled to economize and pushed to innovate due to a surge in demand for new power and utilities offerings. Innovations in power-sector technology, such as new storage battery options and smartphone-based thermostat apps, AI enabled sensors etc; are advancing at a pace that has surprised developers and adopters alike. Consumer’s demands for such products have increased. To meet this, industry leaders are integrating those innovations into their operations and infrastructure as rapidly as they can. On the other hand, companies pursuing energy efficiency have two long-standing goals — gaining a competitive advantage and boosting the bottom line — and a relatively new one: environmental sustainability. Realising the importance of such impending situations in the industry, we have startups like SmartTrace offering an innovative cloud-based platform to quickly manage waste at multiple levels. This includes bridging rough data from waste contractors, extrapolating to volume, EWC, finance and Co2 statistics. Data extracted acts as a guide to improve methodology, educate, strengthen oversight and direct improvements to the bottom line, as well as environmental outcomes. One Concern provides damage estimates using artificial intelligence on natural phenomena sciences. Autogrid organizes energy data and employs big data analytics to generate real-time predictions to create actionable data. The Dreamer: For lifestyle and creative product development and design Consumers in our modern world continually make multiple decisions with regard to product choice due to many competing products in the market.Often those choices boil down to whether it provides better value than others either in terms of product quality, price or by aligning with their personal beliefs and values.Lifestyle products and brands operate off ideologies, hoping to attract a relatively high number of people and ultimately becoming a recognized social phenomenon. While ecommerce has leveraged data science to master the price dimension, here are some examples of startups trying to deconstruct the other two dimensions: product development and branding. I wonder if you have ever imagined your beer to be brewed by AI? Well, now you can with IntelligentX. The Intelligent X team claim to have invented the world's first beer brewed by Artificial intelligence. They also plan to craft a premium beer using complex machine learning algorithms which can improve itself from the feedback given by its customers. Customers are given to try one of their four bottled conditioned beers, after the trial they are asked by their AI what they think of the beer, via an online feedback messenger. The data then collected is used by an algorithm to brew the next batch. Because their AI is constantly reacting to user feedback, they can brew beer that matches what customers want, more quickly than anyone else can. What this actually means that the company gets more data and customers get a customized fresh beer! In the lifestyle domain, we have Stitch Fix which has brought a personal touch to the online shopping journey. They are no regular other apparel e-commerce company. They have created a perfect formula for blending human expertise with the right amount of Data Science to serve their customers. According to Katrina Lake, Founder, and CEO, "You can look at every product on the planet, but trying to figure out which one is best for you is really the challenge” and that’s where Stitch Fix has come into the picture. The company is disrupting traditional retail by bridging the gap of personalized shopping, that the former could not achieve. To know how StitchFix uses Full Stack Data Science read our detailed article. The Writer: From content creation to curation to promotion In the publishing industry, we have seen a digital revolution coming in too. Echobox are one of the pioneers in building AI for the publishing industry. Antoine Amann, founder of Echobox, wrote in a blog post that they have "developed an AI platform that takes large quantity of variables into account and analyses them in real time to determine optimum post performance". Echobox pride itself to currently work with Facebook and Twitter for optimizing social media content, perform advanced analytics with A/B testing and also curate content for desired CTRs. With global client base like The Le Monde, The Telegraph, The Guardian etc. they have conveniently ripped social media editors. New York-based startup Agolo uses AI to create real-time summaries of information. It initially use to curate Twitter feeds in order to focus on conversations, tweets and hashtags that were most relevant to its user's preferences. Using natural language processing, Agolo scans content, identifies relationships among the information sources, and picks out the most relevant information, all leading to a comprehensive summary of the original piece of information. Other websites like Grammarly, offers AI-powered solutions to help people write, edit and formulate mistake-free content. Textio came up with augmented writing which means every time you wrote something and you would come to know ahead of time exactly who is going to respond. It basically means writing which is supported by outcomes in real time. Automated Insights, Creator of Wordsmith, the natural language generation platform enables you to produce human-sounding narratives from data. The Matchmaker: Connecting people, skills and other entities AI will make networking at B2B events more fun and highly productive for business professionals. Grip, a London based startup, formerly known as Network, rebranded itself in the month of April, 2016. Grip is using AI as a platform to make networking at events more constructive and fruitful. It acts as a B2B matchmaking engine that accumulates data from social accounts (LinkedIn, Twitter) and smartly matches the event registration data. Synonymous to Tinder for networking, Grip uses advanced algorithms to recommend the right people and presents them with an easy to use swiping interface feature. It also delivers a detailed report to the event organizer on the success of the event for every user or a social Segment. We are well aware of the data scientist being the sexiest job of the 21st century. JamieAi harnessing this fact connects technical talent with data-oriented jobs organizations of all types and sizes. The start-up firm has combined AI insights and human oversight to reduce hiring costs and eliminate bias.  Also, third party recruitment agencies are removed from the process to boost transparency and efficiency in the path to employment. Another example is Woo.io, a marketplace for matching tech professionals and companies. The Manager: Virtual assistants of a different kind Artificial Intelligence can also predict how much your household appliance will cost on your electricity bill. Verv, a producer of clever home energy assistance provides intelligent information on your household appliances. It helps its customers with a significant reduction on their electricity bills and carbon footprints. The technology uses machine learning algorithms to provide real-time information by learning how much power and money each device is using. Not only this, it can also suggest eco-friendly alternatives, alert homeowners of appliances in use for a longer duration and warn them of any dangerous activity when they aren’t present at home. Other examples include firms like Maana which manages machines and improves operational efficiencies in order to make fast data driven decisions. Gong.io, acts as a sales representative’s assistant to understand sales conversations resulting into actionable insights. ObEN, creates complete virtual identities for consumers and celebrities in the emerging digital world. The Motivator: For personal and business productivity and growth A super cross-functional company Perkbox, came up with an employee engagement platform. Saurav Chopra founder of Perkbox believes teams perform their best when they are happy and engaged! Hence, Perkbox helps companies boost employee motivation and create a more inspirational atmosphere to work. The platform offers gym services, dental discounts and rewards for top achievers in the team to firms in UK. Perkbox offers a wide range of perks, discounts and tools to help organizations retain and motivate their employees. Technologies like AWS and Kubernetes allow to closely knit themselves with their development team. In order to build, scale and support Perkbox application for the growing number of user base. So, these are some use cases where we found startups using data science and machine learning differently. Do you know of others? Please share them in the comments below.
Read more
  • 0
  • 0
  • 16848

article-image-raspberry-pi-led-blueprints
Packt
16 Sep 2015
5 min read
Save for later

Raspberry Pi LED Blueprints

Packt
16 Sep 2015
5 min read
Blinking LEDs is a popular application in the field of embedded development. In Raspberry Pi LED Blueprints by Agus Kurniawan, we are going to design, build, and test LED-based projects using the Raspberry Pi. To Implement real LED-based projects for Raspberry Pi, we need to learn how to interface various LED modules, such as LEDs, 7-segment, 4-digit 7-segment, and dot matrix to Raspberry Pi. We will get hands-on experience by exploring real-time LEDs with this project-based book. (For more resources related to this topic, see here.) Why Raspberry Pi? The Raspberry Pi was designed by the Raspberry Pi Foundation in the UK initially to help schoolkids learn basic computer science knowledge. The Raspberry Pi uses Linux as a basic programming language, and they attempt to come up with their own language that fits this technology better sometime in the future. Although Raspberry Pi is as small as the size of a credit card, it works like a normal computer at a relatively low price. A Raspberry Pi can easily control an LED, which is a simple actuator device that displays lighting. This book will provide you with the ability to control LEDs using Raspberry Pi. What this article covers? This article covers introduction of Raspberry Pi GPIO. In this, we will learn how to use different libraries to access Raspberry Pi GPIO. The step-by-step procedure to install it is also provided along with the Python command. Introducing Raspberry Pi GPIO General-purpose input/output (GPIO) is a generic pin on the Raspberry Pi, which can be used to interact with external devices, for instance, sensor and actuator devices. In general, you can see Raspberry Pi GPIO pinouts in the following figure: To access Raspberry Pi GPIO, we can use several GPIO libraries. If you are working with Python, Raspbian has already installed the RPi.GPIO library to access Raspberry Pi GPIO. You can read more about RPi.GPIO at https://pypi.python.org/pypi/RPi.GPIO. You can verify the RPi.GPIO library from a Python terminal by importing the RPi.GPIO module. If you don’t find this library on Python at runtime or get the error message ImportError: No module named RPi.GPIO, you can install it by compiling from the source code. For instance, if we want to install RPi.GPIO 0.5.11, type the following commands: wget https://pypi.python.org/packages/source/R/RPi.GPIO/RPi.GPIO-0.5.11.tar.gz tar -xvzf RPi.GPIO-0.5.11.tar.gz cd RPi.GPIO-0.5.11/ sudo python setup.py install To install and update through the apt command, your Raspberry Pi must be connected to the Internet. Another way to access Raspberry Pi GPIO is to use WiringPi. It is a library written in C for Raspberry Pi to access GPIO pins. You can read more about WiringPi at http://wiringpi.com/. To install WiringPi, you can type the following commands: sudo apt-get update sudo apt-get install git-core git clone git://git.drogon.net/wiringPi cd wiringPi sudo ./build Please make sure that your Pi network does not block the git protocol for git://git.dragon.net/wiringPi. You can browsed https://git.drogon.net/?p=wiringPi;a=summary for this code. The next step is to install the WiringPi interface for Python, so you can access Raspberry Pi GPIO from the Python program. Type the following commands: sudo apt-get install python-dev python-setuptools git clone https://github.com/Gadgetoid/WiringPi2-Python.git cd WiringPi2-Python sudo python setup.py install When finished, you can verify it by showing GPIO map from the Raspberry Pi board using the following gpio tool: gpio readall You should see the GPIO map from the Raspberry Pi board on the terminal. You can also see values in the wPi column, which will be used in the WirinPi program as GPIO value parameters. In this book, you can find more information about how to use it on the WiringPi library. What you need for this book? We are going to use Raspberry Pi 2 board Model B. To make Raspberry Pi work, we need OS that acts as a bridge between the hardware and the user. There are many OS options that you can use for Raspberry Pi. This book uses Raspbian for the OS platform for Raspberry Pi. To deploy Raspbian on Raspberry Pi 2 Model B, we need microSD card of at least 4 GB size. Who this book is written for? This book is for those who want to learn how to build Raspberry Pi projects using LEDs, 7-segment, 4-digit 7-segment, and dot matrix modules. You will also learn to implement those modules in real applications, including interfacing with wireless modules and the Android mobile app. However, you don't need to have any previous experience with the Raspberry Pi or Android platforms. Summary In this article, we learned different techniques to install Raspberry Pi GPIO. Read Raspberry Pi LED Blueprints to start designing and implementing several projects based on LEDs, such as 7-segments, 4-digit 7-segment, and dot matrix displays. Other related titles are: Raspberry Pi Blueprints Raspberry Pi Super Cluster Learning Raspberry Pi Raspberry Pi Robotic Projects Resources for Article: Further resources on this subject: Color and motion finding [article] Basic Image Processing [article] Develop a Digital Clock [article]
Read more
  • 0
  • 0
  • 16847
article-image-write-first-blockchain-learning-solidity-programming-15-minutes
Aaron Lazar
03 Jan 2018
15 min read
Save for later

Write your first Blockchain: Learning Solidity Programming in 15 minutes

Aaron Lazar
03 Jan 2018
15 min read
[box type="note" align="" class="" width=""]This post is a book extract from the title Mastering Blockchain, authored by Imran Bashir. The book begins with the technical foundations of blockchain, teaching you the fundamentals of cryptography and how it keeps data secure.[/box] Our article aims to quickly get you up to speed with Blockchain development using the Solidity Programming language. Introducing solidity Solidity is a domain-specific language of choice for programming contracts in Ethereum. There are, however, other languages, such as serpent, Mutan, and LLL but solidity is the most popular at the time of writing this. Its syntax is closer to JavaScript and C. Solidity has evolved into a mature language over the last few years and is quite easy to use, but it still has a long way to go before it can become advanced and feature-rich like other well established languages. Nevertheless, this is the most widely used language available for programming contracts currently. It is a statically typed language, which means that variable type checking in solidity is carried out at compile time. Each variable, either state or local, must be specified with a type at compile time. This is beneficial in the sense that any validation and checking is completed at compile time and certain types of bugs, such as interpretation of data types, can be caught earlier in the development cycle instead of at run time, which could be costly, especially in the case of the blockchain/smart contracts paradigm. Other features of the language include inheritance, libraries, and the ability to define composite data types. Solidity is also a called contract-oriented language. In solidity, contracts are equivalent to the concept of classes in other object-oriented programming languages. Types Solidity has two categories of data types: value types and reference types. Value types These are explained in detail here. Boolean This data type has two possible values, true or false, for example: bool v = true; This statement assigns the value true to v. Integers This data type represents integers. A table is shown here, which shows various keywords used to declare integer data types. For example, in this code, note that uint is an alias for uint256: uint256 x; uint y; int256 z; These types can also be declared with the constant keyword, which means that no storage slot will be reserved by the compiler for these variables. In this case, each occurrence will be replaced with the actual value: uint constant z=10+10; State variables are declared outside the body of a function, and they remain available throughout the contract depending on the accessibility assigned to them and as long as the contract persists. Address This data type holds a 160-bit long (20 byte) value. This type has several members that can be used to interact with and query the contracts. These members are described here: Balance The balance member returns the balance of the address in Wei. Send This member is used to send an amount of ether to an address (Ethereum's 160-bit address) and returns true or false depending on the result of the transaction, for example, the following: address to = 0x6414cc08d148dce9ebf5a2d0b7c220ed2d3203da; address from = this; if (to.balance < 10 && from.balance > 50) to.send(20); Call functions The call, callcode, and delegatecall are provided in order to interact with functions that do not have Application Binary Interface (ABI). These functions should be used with caution as they are not safe to use due to the impact on type safety and security of the contracts. Array value types (fixed size and dynamically sized byte arrays) Solidity has fixed size and dynamically sized byte arrays. Fixed size keywords range from bytes1 to bytes32, whereas dynamically sized keywords include bytes and strings. bytes are used for raw byte data and string is used for strings encoded in UTF-8. As these arrays are returned by the value, calling them will incur gas cost. length is a member of array value types and returns the length of the byte array. An example of a static (fixed size) array is as follows: bytes32[10] bankAccounts; An example of a dynamically sized array is as follows: bytes32[] trades; Get length of trades: trades.length; Literals These are used to represent a fixed value. Integer literals Integer literals are a sequence of decimal numbers in the range of 0-9. An example is shown as follows: uint8 x = 2; String literals String literals specify a set of characters written with double or single quotes. An example is shown as follows: 'packt' "packt” Hexadecimal literals Hexadecimal literals are prefixed with the keyword hex and specified within double or single quotation marks. An example is shown as follows: (hex'AABBCC'); Enums This allows the creation of user-defined types. An example is shown as follows: enum Order{Filled, Placed, Expired }; Order private ord; ord=Order.Filled; Explicit conversion to and from all integer types is allowed with enums. Function types There are two function types: internal and external functions. Internal functions These can be used only within the context of the current contract. External functions External functions can be called via external function calls. A function in solidity can be marked as a constant. Constant functions cannot change anything in the contract; they only return values when they are invoked and do not cost any gas. This is the practical implementation of the concept of call as discussed in the previous chapter. The syntax to declare a function is shown as follows: function <nameofthefunction> (<parameter types> <name of the variable>) {internal|external} [constant] [payable] [returns (<return types> <name of the variable>)] Reference types As the name suggests, these types are passed by reference and are discussed in the following section. Arrays Arrays represent a contiguous set of elements of the same size and type laid out at a memory location. The concept is the same as any other programming language. Arrays have two members named length and push: uint[] OrderIds; Structs These constructs can be used to group a set of dissimilar data types under a logical group. These can be used to define new types, as shown in the following example: Struct Trade { uint tradeid; uint quantity; uint price; string trader; } Data location Data location specifies where a particular complex data type will be stored. Depending on the default or annotation specified, the location can be storage or memory. This is applicable to arrays and structs and can be specified using the storage or memory keywords. As copying between memory and storage can be quite expensive, specifying a location can be helpful to control the gas expenditure at times. Calldata is another memory location that is used to store function arguments. Parameters of external functions use calldata memory. By default, parameters of functions are stored in memory, whereas all other local variables make use of storage. State variables, on the other hand, are required to use storage. Mappings Mappings are used for a key to value mapping. This is a way to associate a value with a key. All values in this map are already initialized with all zeroes, for example, the following: mapping (address => uint) offers; This example shows that offers is declared as a mapping. Another example makes this clearer: mapping (string => uint) bids; bids["packt"] = 10; This is basically a dictionary or a hash table where string values are mapped to integer values. The mapping named bids has a packt string value mapped to value 10. Global variables Solidity provides a number of global variables that are always available in the global namespace. These variables provide information about blocks and transactions. Additionally, cryptographic functions and address-related variables are available as well. A subset of available functions and variables is shown as follows: keccak256(...) returns (bytes32) This function is used to compute the keccak256 hash of the argument provided to the Function: ecrecover(bytes32 hash, uint8 v, bytes32 r, bytes32 s) returns (address) This function returns the associated address of the public key from the elliptic curve signature: block.number This returns the current block number. Control structures Control structures available in solidity are if - else, do, while, for, break, continue, return. They work in a manner similar to how they work in C-language or JavaScript. Events Events in solidity can be used to log certain events in EVM logs. These are quite useful when external interfaces are required to be notified of any change or event in the contract. These logs are stored on the blockchain in transaction logs. Logs cannot be accessed from the contracts but are used as a mechanism to notify change of state or the occurrence of an event (meeting a condition) in the contract. In a simple example here, the valueEvent event will return true if the x parameter passed to function Matcher is equal to or greater than 10: contract valueChecker { uint8 price=10; event valueEvent(bool returnValue); function Matcher(uint8 x) returns (bool) { if (x>=price) { valueEvent(true); return true; } } } Inheritance Inheritance is supported in solidity. The is keyword is used to derive a contract from another contract. In the following example, valueChecker2 is derived from the valueChecker contract. The derived contract has access to all nonprivate members of the parent contract: contract valueChecker { uint8 price=10; event valueEvent(bool returnValue); function Matcher(uint8 x) returns (bool) {  if (x>=price)  {   valueEvent(true);   return true;   }  } } contract valueChecker2 is valueChecker { function Matcher2() returns (uint) { return price + 10; }     } In the preceding example, if uint8 price = 10 is changed to uint8 private price = 10, then it will not be accessible by the valuechecker2 contract. This is because now the member is declared as private, it is not allowed to be accessed by any other contract. Libraries Libraries are deployed only once at a specific address and their code is called via CALLCODE/DELEGATECALL Opcode of the EVM. The key idea behind libraries is code reusability. They are similar to contracts and act as base contracts to the calling contracts. A library can be declared as shown in the following example: library Addition { function Add(uint x,uint y) returns (uint z)  {    return x + y;  } } This library can then be called in the contract, as shown here. First, it needs to be imported and it can be used anywhere in the code. A simple example is shown as follows: Import "Addition.sol" function Addtwovalues() returns(uint) { return Addition.Add(100,100); } There are a few limitations with libraries; for example, they cannot have state variables and cannot inherit or be inherited. Moreover, they cannot receive Ether either; this is in contrast to contracts that can receive Ether. Functions Functions in solidity are modules of code that are associated with a contract. Functions are declared with a name, optional parameters, access modifier, optional constant keyword, and optional return type. This is shown in the following example: function orderMatcher(uint x) private constant returns(bool returnvalue) In the preceding example, function is the keyword used to declare the function. orderMatcher is the function name, uint x is an optional parameter, private is the access modifier/specifier that controls access to the function from external contracts, constant is an optional keyword used to specify that this function does not change anything in the contract but is used only to retrieve values from the contract instead, and returns (bool returnvalue) is the optional return type of the function. How to define a function: The syntax of defining a function is shown as follows: function <name of the function>(<parameters>) <visibility specifier> returns (<return data type> <name of the variable>) {  <function body> } Function signature: Functions in solidity are identified by its signature, which is the first four bytes of the keccak-256 hash of its full signature string. This is also visible in browser solidity, as shown in the following screenshot. D99c89cb is the first four bytes of 32 byte keccak-256 hash of the function named Matcher. In this example function, Matcher has the signature hash of d99c89cb. This information is useful in order to build interfaces. Input parameters of a function: Input parameters of a function are declared in the form of <data type> <parameter name>. This example clarifies the concept where uint x and uint y are input parameters of the checkValues function: contract myContract { function checkValues(uint x, uint y) { } } Output parameters of a function: Output parameters of a function are declared in the form of <data type> <parameter name>. This example shows a simple function returning a uint value: contract myContract { Function getValue() returns (uint z) {  z=x+y; } } A function can return multiple values. In the preceding example function, getValue only returns one value, but a function can return up to 14 values of different data types. The names of the unused return parameters can be omitted optionally. Internal function calls: Functions within the context of the current contract can be called internally in a direct manner. These calls are made to call the functions that exist within the same contract. These calls result in simple JUMP calls at the EVM byte code level. External function calls: External function calls are made via message calls from a contract to another contract. In this case, all function parameters are copied to the memory. If a call to an internal function is made using the this keyword, it is also considered an external call. The this variable is a pointer that refers to the current contract. It is explicitly convertible to an address and all members for a contract are inherited from the address. Fall back functions: This is an unnamed function in a contract with no arguments and return data. This function executes every time ether is received. It is required to be implemented within a contract if the contract is intended to receive ether; otherwise, an exception will be thrown and ether will be returned. This function also executes if no other function signatures match in the contract. If the contract is expected to receive ether, then the fall back function should be declared with the payable modifier. The payable is required; otherwise, this function will not be able to receive any ether. This function can be called using the address.call() method as, for example, in the following: function () { throw; } In this case, if the fallback function is called according to the conditions described earlier; it will call throw, which will roll back the state to what it was before making the call. It can also be some other construct than throw; for example, it can log an event that can be used as an alert to feed back the outcome of the call to the calling application. Modifier functions: These functions are used to change the behavior of a function and can be called before other functions. Usually, they are used to check some conditions or verification before executing the function. _(underscore) is used in the modifier functions that will be replaced with the actual body of the function when the modifier is called. Basically, it symbolizes the function that needs to be guarded. This concept is similar to guard functions in other languages. Constructor function: This is an optional function that has the same name as the contract and is executed once a contract is created. Constructor functions cannot be called later on by users, and there is only one constructor allowed in a contract. This implies that no overloading functionality is available. Function visibility specifiers (access modifiers): Functions can be defined with four access specifiers as follows: External: These functions are accessible from other contracts and transactions. They cannot be called internally unless the this keyword is used. Public: By default, functions are public. They can be called either internally or using messages. Internal: Internal functions are visible to other derived contracts from the parent contract. Private: Private functions are only visible to the same contract they are declared in. Other important keywords/functions throw: throw is used to stop execution. As a result, all state changes are reverted. In this case, no gas is returned to the transaction originator because all the remaining gas is consumed. Layout of a solidity source code file Version pragma In order to address compatibility issues that may arise from future versions of the solidity compiler version, pragma can be used to specify the version of the compatible compiler as, for example, in the following: pragma solidity ^0.5.0 This will ensure that the source file does not compile with versions smaller than 0.5.0 and versions starting from 0.6.0. Import Import in solidity allows the importing of symbols from the existing solidity files into the current global scope. This is similar to import statements available in JavaScript, as for example, in the following: Import "module-name"; Comments Comments can be added in the solidity source code file in a manner similar to C-language. Multiple line comments are enclosed in /* and */, whereas single line comments start with //. An example solidity program is as follows, showing the use of pragma, import, and comments: To summarize, we went through a brief introduction to the solidity language. Detailed documentation and coding guidelines are available online. If you found this article useful, and would like to learn more about building blockchains, go ahead and grab the book Mastering Blockchain, authored by Imran Bashir.  
Read more
  • 0
  • 0
  • 16842

article-image-open-source-intelligence
Packt
30 Dec 2014
30 min read
Save for later

Open Source Intelligence

Packt
30 Dec 2014
30 min read
This article is written by Douglas Berdeaux, the author of Penetration Testing with Perl. Open source intelligence (OSINT) refers to intelligence gathering from open and public sources. These sources include search engines, the client target's web-accessible software or sites, social media sites and forums, Internet routing and naming authorities, public information sites, and more. If done properly and thoroughly, the practice of OSINT can prove to be useful to strengthen social engineering and remote exploitation attacks on our client target as we search for ways to gain access to their systems and buildings during a penetration test. What's covered In this article, we will cover how to gather the information listed using Perl: E-mail addresses from our client target using search engines and social media sites Networking, hosting, routing, and system data of our client target using online resources and simple networking utilities To gather this data, we rely heavily on the LWP::UserAgent Perl module. We will also discover how to use this module with a secured socket layer SSL/TLS (HTTPS) connection. In addition to this, we will learn about a few new Perl modules that are listed here: Net::Whois::Raw Net::DNS::Dig Net::DNS Net::Traceroute XML::LibXML Google dorks Before we use Google for intelligence gathering, we should briefly touch upon using Google dorks, which we can use to refine and filter our Google searches. A Google dork is a string of special syntax that we pass to Google's request handler using the q= option. The dork can comprise operators and keywords separated by a colon and concatenated strings using a plus symbol + as a delimiter. Here is a list of simple Google dorks that we can use to narrow our Google searches: intitle:<string> searches for pages whose HTML title tags contain the string <string> filetype:<ext> searches for files that have the extension <ext> site:<domain> narrows the search to only results that are located on the <domain> target servers inurl:<string> returns results that contain <string> in their URL -<word> negates the word following the minus symbol - in a search filter link:<page> searches for pages that contain the HTML HREF links to the page This is just a small list and a complete guide of Google search operators that can be found on their support servers. A list of well-known exploited Google dorks for information gathering can be found in a Google hacker's database at http://www.exploit-db.com/google-dorks/. E-mail address gathering Getting e-mail addresses from our target can be a rather hard task and can also mean gathering usernames used within the target's domain, remote management systems, databases, workstations, web applications, and much more. As we can imagine, gathering a username is 50 percent of the intrusion for target credential harvesting; the other 50 percent being the password information. So how do we gather e-mail addresses from a target? Well, there are several methods; the first we will look at will be simply using search engines to crawl the web for anything useful, including forum posts, social media, e-mail lists for support, web pages and mailto links, and anything else that was cached or found from ever-spidering search engines. Using Google for e-mail address gathering Automating queries to search engines is usually always best left to application programming interfaces (APIs). We might be able to query the search engine via a simple GET request, but this leaves enough room for error, and the search engine can potentially temporarily block our IP address or force us to validate our humanness using an image of words as it might assume that we are using a bot. Unfortunately, Google only offers a paid version of their general search API. They do, however, offer an API for a custom search, but this is restricted to specified domains. We want to be as thorough as possible and search as much of the web as we can, time permitting, when intelligence gathering. Let's go back to our LWP::UserAgent Perl module and make a simple request to Google, searching for any e-mail addresses and URLs from a given domain. The URLs are useful as they can be spidered to within our application if we feel inclined to extend the reach of our automated OSINT. In the following examples, we want to impersonate a browser as much as possible to not raise flags at Google by using automation. We accomplish this by using the LWP::UserAgent Perl module and spoofing a valid Firefox user agent: #!/usr/bin/perl -w use strict; use LWP::UserAgent; use LWP::Protocol::https; my $usage = "Usage ./email_google.pl <domain>"; my $target = shift or die $usage; my $ua = LWP::UserAgent->new; my %emails = (); # unique my $url = 'https://www.google.com/search?num=100&start=0&hl=en&meta=&q=%40%22'.$target.'%22'; $ua->agent("Mozilla/5.0 (Windows; U; Windows NT 6.1 en-US; rv:1.9.2.18) Gecko/20110614 Firefox/3.6.18"); $ua->timeout(10); # setup a timeout $ua->show_progress(1); # display progress bar my $res = $ua->get($url); if($res->is_success){ my @urls = split(/url?q=/,$res->as_string); foreach my $gUrl (@urls){ # Google URLs next if($gUrl =~ m/(webcache.googleusercontent)/i or not $gUrl =~ m/^http/); $gUrl =~ s/&amp;sa=U.*//; print $gUrl,"n"; } my @emails = $res->as_string =~ m/[a-z0-9_.-]+@/ig; foreach my $email (@emails){ if(not exists $emails{$email}){    print "Possible Email Match: ",$email,$target,"n";    $emails{$email} = 1; # hashes are faster } } } else{ die $res->status_line; } The LWP::UserAgent module used in the previous code is not new to us. We did, however, add SSL support using the LWP::Protocol::https module. Our URL $url object is a simple Google search URL that anyone would browse to with a normal browser. The num= value pertains the returned results from Google in a single page, which we have set to 100. To also act as a browser, we needed to set the user agent with the agent() method, which we did as a Mozilla browser. After this, we set a timeout and Boolean to show a simple progress bar. The rest is just simple Perl string manipulation and pattern matching. We use the regular expression url?q= to split the string returned by the as_string method from the $res object. Then, for each URL string, we use another regular expression, &amp;sa=U.*, to remove excess analytic garbage that Google adds. Then, we simply parse out all e-mail addresses found using the same method but different regexp. We stuff all matches into the @emails array and loop over them, displaying them to our screen if they don't exist in the $emails{} Perl hash. Let's run this program against the weaknetlabs.com domain and analyze the output: root@wnld960:~# perl email_google.pl weaknetlabs.com ** GET https://www.google.com/search?num=100&start=0&hl=en&meta=&q=%40%22weaknetlabs.com%22 ==> 200 OK (1s) http://weaknetlabs.com/ http://weaknetlabs.com/main/%3Fpage_id%3D479 … http://www.securitytube.net/video/2039 Possible Email Match: Douglas@weaknetlabs.com Possible Email Match: weaknetlabs@weaknetlabs.com root@wnld960:~# This is the (trimmed) output when we run an automated Google search for an e-mail address from weaknetlabs.com. Using social media for e-mail address gathering Now, let's turn our attention to using social media sites such as Google+, LinkedIn, and Facebook to try to gather e-mail addresses using Perl. Social media sites can sometimes reflect information about an employee's attitude towards their employer, their status within the company, position, e-mail addresses, and more. All of this information is considered OSINT and can be useful when advancing our attacks. Google+ We can also search plus.google.com for contact information from users belonging to our target. The following is the URL-encoded Google dork we will use to search the Google+ profiles for an employee of our target: intitle%3A"About+-+Google%2B"+"Works+at+'.$target.'"+site%3Aplus.google.com The URL-encoded symbols are as follows: %3A: This is a colon, that is, : %2B: This is a plus symbol, that is, + The plus symbol + is a special component of Google dork, as we mentioned in the previous section. The intitle keyword tells Google to display results whose HTML <title> tag contains the About – Google+ text. Then, we add the string (in quotations) "Works at " (notice the space at the end), and then the target name as the string object $target. The site keyword tells the Google search engine to only display results from the plus.google.com site. Let's implement this in our Perl program and see what results are returned for Google employees: #!/usr/bin/perl -w use strict; use LWP::UserAgent; use LWP::Protocol::https; my $ua = LWP::UserAgent->new; my $usage = "Usage ./googleplus.pl <target name>"; my $target = shift or die $usage; $target =~ s/s/+/g; my $gUrl = 'https://www.google.com/search?safe=off&noj=1&sclient=psy-ab&q=intitle%3A"About+-+Google%2B"+"Works+at+' .$target.'"+site%3Aplus.google.com&oq=intitle%3A"About+-+Google%2B"+"Works+at+'.$target.'"+site%3Aplus.google.com'; $ua->agent("Mozilla/5.0 (Windows; U; Windows NT 6.1 en-US; rv:1.9.2.18) Gecko/20110614 Firefox/3.6.18"); $ua->timeout(10); # setup a timeout my $res = $ua->get($gUrl); if($res->is_success){ foreach my $string (split(/url?q=/,$res->as_string)){ next if($string =~ m/(webcache.googleusercontent)/i or not $string =~ m/^http/); $string =~ s/&amp;sa=U.*//; print $string,"n"; } } else{ die $res->status_line; } This Perl program is quite similar to our last search program. Now, let's run this to find possible Google employees. Since a target client company can have spaces in its name, we accommodate them by encoding them for Google as plus symbols: root@wnld960:~# perl googleplus.pl google https://plus.google.com/%2BPaulWilcox/about https://plus.google.com/%2BNatalieVillalobos/about ... https://plus.google.com/%2BAndrewGerrand/about root@wnld960:~# The preceding (trimmed) output proves that our Perl script works as we browse to the returned results. These two Google search scripts provided us with some great information quickly. Let's move on to another example, not using Google but LinkedIn, a social media site for professionals. LinkedIn LinkedIn can provide us with the contact information and IT skill levels of our client target during a penetration test. Here, we will focus on the contact information. By now, we should feel very comfortable making any type of web request using LWP::UserAgent and parsing its output for intelligence data. In fact, this LinkedIn example should be a breeze. The trick is fine-tuning our filters and regular expressions to get only relevant data. Let's just dive right into the code and then analyze some sample output: #!/usr/bin/perl -w use strict; use LWP::UserAgent; use LWP::Protocol::https; my $ua = LWP::UserAgent->new; my $usage = "Usage ./googlepluslinkedin.pl <target name>"; my $target = shift or die $usage; my $gUrl = 'https://www.google.com/search?q=site:linkedin.com+%22at+'.$target.'%22'; my %lTargets = (); # unique $ua->agent("Mozilla/5.0 (Windows; U; Windows NT 6.1 en-US; rv:1.9.2.18) Gecko/20110614 Firefox/3.6.18"); $ua->timeout(10); # setup a timeout my $google = getUrl($gUrl); # one and ONLY call to Google foreach my $title ($google =~ m/shref="/url?.*">[a-z0-9_. -]+s?.b.at $target..b.s-slinked/ig){ my $lRurl = $title; $title =~ s/.*">([^<]+).*/$1/; $lRurl =~ s/.*url?.*q=(.*)&amp;sa.*/$1/; print $title,"-> ".$lRurl."n"; my @ln = split(/15?12/,getUrl($lRurl)); foreach(@ln){ if(m/title="/i){    my $link = $_;    $link =~ s/.*href="([^"]+)".*/$1/;    next if exists $lTargets{$link};    $lTargets{$link} = 1;    my $name = $_;    $name =~ s/.*title="([^"]+)".*/$1/;    print "t",$name," : ",$link,"n"; } } } sub getUrl{ sleep 1; # pause... my $res = $ua->get(shift); if($res->is_success){ return $res->as_string; }else{ die $res->status_line; } } The preceding Perl program makes one query to Google to find all possible positions from the target; for each position found, it queries LinkedIn to find employees of the target. The regular expressions used were finely crafted after inspection of the returned HTML object from a simple query to both Google and LinkedIn. This is a great example of how we can spider off from our initial Google results to gather even more intelligence using Perl automation. Let's take a look at some sample outputs from this program when used against Walmart.com: root@wnld960:~# perl linkedIn.pl Walmart Buyer : http://www.linkedin.com/title/buyer/at-walmart/        Jason Kloster : http://www.linkedin.com/in/jasonkloster       Rajiv Ahirwal : http://www.linkedin.com/in/rajivahirwal ... Store manager : http://www.linkedin.com/title/store%2Bmanager/at-walmart/        Benjamin Hunt 13k+ (LION) #1 Connected Leader at Walmart : http://www.linkedin.com/in/benjaminhunt01 ... Shift manager : http://www.linkedin.com/title/shift%2Bmanager/at-walmart/        Frank Burns : http://www.linkedin.com/pub/frank-burns/24/83b/285 ... Assistant store manager : http://www.linkedin.com/title/assistant%2Bstore%2Bmanager/at-walmart/        John Cole : http://www.linkedin.com/pub/john-cole/67/392/b39        Crystal Herrera : http://www.linkedin.com/pub/crystal-herrera/92/74a/97b root@wnld960:~# The preceding (trimmed) output provided some great insight into employee positions, and even real employees in those positions of the target, with a simple call to one script. All of this information is publicly available information and we are not directly attacking Walmart or its employees; we are just using this as an example of intelligence-gathering techniques during a penetration test using Perl programming. This information can further be used for reporting, and we can even extend this data into other areas of research. For instance, we can easily follow the LinkedIn links with LWP::UserAgent and pull even more data from the publicly available LinkedIn profiles. This data, when compared to Google+ profile data and simple Google searches, should help in providing a background to create a more believable pretext for social engineering. Now, let's see if we can use Google to search more social media websites for information on our client target. Facebook We can easily argue that Facebook is one of the largest social networking sites around during the writing of this book. Facebook can easily return a large amount of data about a person, and we don't even have to go to the site to get it! We can easily extend our reach into the Web with the gathered employee names, from our previous code, by searching Google using the site:faceboook.com parameter and the exact same syntax as from the first example in the Google section of the E-mail address gathering section. The following are a few simple Google dorks that can possibly reveal information about our client target: site:facebook.com "manager at target" site:facebook.com "ceo at target" site:facebook.com "owner of target" site:facebook.com "experience at target" This information can return customer and employee criticism that can be used for a wide array of penetration-testing purposes, including social engineering pretexting. We can narrow our focus even further by adding other keywords and strings from our previously gathered intelligence, such as city names, company names, and more. Just about anything returned can be compiled into a unique wordlist for password cracking, and contrasted with the known data with Digital Credential Analysis (DCA). Domain Name Services Domain Name Services (DNS) are used to translate IP addresses into hostnames so that we can use alphanumeric addresses instead of IP addresses for websites or services. It makes our lives a lot easier when typing in a URL with a name rather than a 4-byte numerical value. Any client target can potentially have full control over their naming services. DNS A records can be assigned to any IP address. We can easily write our own record with domain control for an IPv4 class A address, such as 10.0.0.1, which is commonly done for an internal network to allow its users to easily connect to different internal services. The Whois query Sometimes, when we can get an IP address for a client target, we can pass this IP address to the Whois database, and in return, we can get a range of IP addresses in which our IP lies and the organization that owns the range. If the organization is our target, then we now know a range of IP addresses pointing directly to their resources. Usually, this information is given during a penetration test, and the limitations on the lengths that we are allowed to go to for IP ranges are set so that we can be limited simply to reporting. Let's use Perl and the Net::Whois::Raw module to interact with the American Registry for Internet Numbers (ARIN) database for an IP address: #!/usr/bin/perl -w use strict; use Net::Whois::Raw; die "Usage: perl netRange.pl <IP Address>" unless $ARGV[0]; foreach(split(/n/,whois(shift))){ print $_,"n" if(m/^(netrange|orgname)/i); } The preceding code, when run, should produce information about the network range and organization name that owns the range. It is very simple, and it can be compared to calling the whois program form the Linux command line. If we were to script this to run through a number of different IP addresses and run the Whois query against each one, we could be violating the terms of service set by ARIN. Let's test it and see what we get with a random IP address: root@wnld960:~# perl whois.pl 198.123.2.22 NetRange:       198.116.0.0 - 198.123.255.255 OrgName:       National Aeronautics and Space Administration root@wnld960:~# This is the output from our Perl program, which reveals an IP range that can belong to the organization listed. If this fails, and we need to find more than one hostname owned by our client target, we can try a brute force method that simply checks our name servers; we will do just that in the next section. The DIG query DIG stands for domain information groper and is a utility to do just that using DNS queries. The DIG Linux utility has actually replaced the older host and nslookup. In making these queries, one thing to note is that when we don't specify a name server to use, the DIG utility will simply use the Linux OS default resolver. We can, however, pass a name server to DIG; we will cover this in the upcoming section, Zone transfers. There is a nice object-oriented Perl module for DIG that we will examine, which is called Net::DNS::Dig. Let's quickly look at an example to query our DNS with this module: #!/usr/bin/perl -w use Net::DNS::Dig; use strict; my $dig = new Net::DNS::Dig(); my $dom = shift or die "Usage: perl dig.pl <domain>"; my $dobj = $dig->for($dom, 'A'); # print $dobj->sprintf; # print entire dig query response print "CODE: ",$dobj->rcode(1),"n"; # Dig Response Code my %mx = Net::DNS::Dig->new()->for($dom,'MX')->rdata(); while(my($val,$server) = each(%mx)){ print "MX: ",$server," - ",$val,"n"; } The preceding code is simple. We create a DIG object $dig and call the for() method, passing the domain name we pulled by shifting the command-line arguments and types for A records. We print the returned response with sprintf(), and then the response code alone with the rcode() method. Finally, we create a hash object %mx from the rdata() method. We pass the rdata() object returned from making a new Net::DNS::Dig object, and call the for() method on it with a type of MX for the mail server. Let's try this against a domain and see what is returned: root@wnld960:~# perl dig.pl weaknetlabs.com ; <<>> Net::DNS::Dig 0.12 <<>> -t a weaknetlabs.com. ;; ;; Got answer. ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34071 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;weaknetlabs.com.               IN     A ;; ANSWER SECTION: weaknetlabs.com.       300    IN     A       198.144.36.192 ;; Query time: 118 ms ;; SERVER: 75.75.76.76# 53(75.75.76.76) ;; WHEN: Mon May 19 18:26:31 2014 ;; MSG SIZE rcvd: 49 -- XFR size: 2 records CODE: NOERROR MX: mailstore1.secureserver.net - 10 MX: smtp.secureserver.net – 0 The output is just as expected. Everything above the line starting with CODE is the response from making the DIG query. CODE is returned from the rcode() method. Since we passed a true value to rcode(), we got a string type, NOERROR, returned. Next, we printed the key and value pairs of the %mx Perl hash, which displayed our target's e-mail server names. Brute force enumeration Keeping the previous lesson in mind, and knowing that Linux offers a great wealth of networking utilities, we might be inclined to write our own DNS brute force tool to enumerate any possible A records that our client target could have made prior to our penetration test. Let's take a quick look at the nslookup utility we can use to check if a record exists: trevelyn@wnld960:~$ nslookup admin.warcarrier.org Server:         75.75.76.76 Address:       75.75.76.76#53 Non-authoritative answer: Name:   admin.warcarrier.org Address: 10.0.0.1 trevelyn@wnld960:~$ nslookup admindoesntexist.warcarrier.org Server:         75.75.76.76 Address:        75.75.76.76#53 ** server can't find admindoesntexist.warcarrier.org: NXDOMAIN trevelyn@wnld960:~$ This is the output of two calls to nslookup, the networking utility used for returning IP addresses of hostnames, and vice versa. The first A record check was successful, and the second, the admindoesntexist subdomain, was not. We can easily see from the output of this program how we can parse it to check whether the subdomain exists. We can also see from the two subdomains that we can use a simple word list of commonly used subdomains for efficiency, before trying many possible combinations. A lot of intelligence gathering might have already been done for you by search engines such as Google. In fact, the keyword search site: can return more than just the www subdomains. If we broaden our num= URL GET parameter and loop through all possible results by incrementing the start= parameter, we can potentially get results from other subdomains of our target. Now that we have seen the basic query for a subdomain, let's turn our focus to use Perl and a new Perl module, Net::DNS, to enumerate a few subdomains: #!/usr/bin/perl -w use strict; use Net::DNS; my $dns = Net::DNS::Resolver->new; my @subDomains = ("admin","admindoesntexist","www","mail","download","gateway"); my $usage = "perl domainbf.pl <domain name>"; my $domain = shift or die $usage; my $total = 0; dns($_) foreach(@subDomains); print $total," records testedn"; sub dns{ # search sub domains: $total++; # record count my $hn = shift.".".$domain; # construct hostname my $dnsLookup = $dns->search($hn); if($dnsLookup){ # successful lookup my $t=0; foreach my $ip ($dnsLookup->answer){    return unless $ip->type eq "A" and $t<1; # A records    print $hn,": ",$ip->address,"n"; # just the IP    $t++; } } return; } The preceding Perl program loops through the @domains array and calls the dns() subroutine on each, which returns or prints a successful query. The $t integer token is used for subdomains, which has several identical records to avoid repetition in the program's output. After this, we simply print the total of the records tested. This program can be easily modified to open a word list file, and we can loop through each by passing them to the dns() subroutine, with something similar to the following: open(FLE,"file.txt"); while(<FLE>){ dns($_); } Zone transfers As we have seen with an A record, the admin.warcarrier.org entry provided us with some insight as to the IP range of the internal network, or the class A address 10.0.0.1. Sometimes, when a client target is controlling and hosting their own name servers, they accidentally allow DNS zone transfers from their name servers into public name servers, providing the attacker with information where the target's resources are. Let's use the Linux host utility to check for a DNS zone transfer: [trevelyn@shell ~]$ host -la warcarrier.org beth.ns.cloudflare.com Trying "warcarrier.org" Using domain server: Name: beth.ns.cloudflare.com Address: 2400:cb00:2049:1::adf5:3a67#53 Aliases: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20461 ;; flags: qr aa; QUERY: 1, ANSWER: 13, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ; warcarrier.org.           IN     AXFR ;; ANSWER SECTION: warcarrier.org.     300     IN     SOA     beth.ns.cloudflare.com. warcarrier.org. beth.ns.cloudflare.com. 2014011513 18000 3600 86400 1800 warcarrier.org.     300     IN     NS     beth.ns.cloudflare.com. warcarrier.org.     300     IN     NS     hank.ns.cloudflare.com. warcarrier.org.     300     IN     A      50.97.177.66 admin.warcarrier.org. 300     IN     A       10.0.0.1 gateway.warcarrier.org. 300 IN   A       10.0.0.124 remote.warcarrier.org. 300     IN     A       10.0.0.15 partner.warcarrier.org. 300 IN   CNAME   warcarrier.weaknetlabs.com. calendar.warcarrier.org. 300 IN   CNAME   login.secureserver.net. direct.warcarrier.org. 300 IN   CNAME   warcarrier.org. warcarrier.org.     300     IN     SOA     beth.ns.cloudflare.com. warcarrier.org. beth.ns.cloudflare.com. 2014011513 18000 3600 86400 1800 Received 401 bytes from 2400:cb00:2049:1::adf5:3a67#53 in 56 ms [trevelyn@shell ~]$ As we see from the output of the host command, we have found a successful DNS zone transfer, which provided us with even more hostnames used by our client target. This attack has provided us with a few CNAME records, which are used as aliases to other servers owned or used by our target, the subnet (class A) IP addresses used by the target, and even the name servers used. We can also see that the default name, direct, used by CloudFlare.com is still set for the cloud service to allow connections directly to the IP of warcarrier.org, which we can use to bypass the cloud service. The host command requires the name server, in our case beth.ns.cloudflare.com, before performing the transfer. What this means for us is that we will need the name server information before querying for a potential DNS zone transfer in our Perl programs. Let's see how we can use Net::DNS for the entire process: #!/usr/bin/perl -w use strict; use Net::DNS; my $usage = "perl dnsZt.pl <domain name>"; die $usage unless my $dom = shift; my $res = Net::DNS::Resolver->new; # dns object my $query = $res->query($dom,"NS"); # query method call for nameservers if($query){ # query of NS was successful foreach my $rr (grep{$_->type eq 'NS'} $query->answer){ $res->nameservers($rr->nsdname); # set the name server print "[>] Testing NS Server: ".$rr->nsdname."n"; my @subdomains = $res->axfr($dom); if ($#subdomains > 0){    print "[!] Successful zone transfer:n";    foreach (@subdomains){    print $_->name."n"; # returns a Net::DNS::RR object    } }else{ # 0 returned domains    print "[>] Transfer failed on " . $rr->nsdname . "n"; } } }else{ # Something went wrong: warn "query failed: ", $res->errorstring,"n"; } The preceding program that uses the Net::DNS Perl module will first query for the name servers used by our target and then test the DNS zone transfer for each target. The grep() function returns a list to the foreach() loop of all name servers (NS) found. The foreach() loop then simply attempts the DNS zone transfer (AXFR) and returns the results if the array is larger than zero elements. Let's test the output on our client target: [trevelyn@shell ~]$ perl dnsZt.pl warcarrier.org [>] Testing NS Server: hank.ns.cloudflare.com [!] Successful zone transfer: warcarrier.org warcarrier.org admin.warcarrier.org gateway.warcarrier.org remote.warcarrier.org partner.warcarrier.org calendar.warcarrier.org direct.warcarrier.org [>] Testing NS Server: beth.ns.cloudflare.com [>] Transfer failed on beth.ns.cloudflare.com [trevelyn@shell ~]$ The preceding (trimmed) output is a successful DNS zone transfer on one of the name servers used by our client target. Traceroute With knowledge of how to glean hostnames and IP addresses from simple queries using Perl, we can take the OSINT a step further and trace our route to the hosts to see what potential target-owned hardware can intercept or relay traffic. For this task, we will use the Net::Traceroute Perl module. Let's take a look at how we can get the IP host information from relaying hosts between us and our target, using this Perl module and the following code: #!/usr/bin/perl -w use strict; use Net::Traceroute; my $dom = shift or die "Usage: perl tracert.pl <domain>"; print "Tracing route to ",$dom,"n"; my $tr = Net::Traceroute->new(host=>$dom,use_tcp=>1); for(my$i=1;$i<=$tr->hops;$i++){        my $hop = $tr->hop_query_host($i,0);        print "IP: ",$hop," hop time: ",$tr->hop_query_time($i,0),               "ms hop status: ",$tr->hop_query_stat($i,0),                " query count: ",$tr->hop_queries($i),"n" if($hop); } In the preceding Perl program, we used the Net::Traceroute Perl module to perform a trace route to the domain given by a command-line argument. The module must be used by first calling the new() method, which we do when defining $tr as a query object. We tell the trace route object $tr that we want to use TCP and also pass the host, which we shift from the command-line arguments. We can pass a lot more parameters to the new() method, one of which is debug=>9 to debug our trace route. A full list can be obtained from the CPAN Search page of the Perl module that can be accessed at http://search.cpan.org/~hag/Net-Traceroute/Traceroute.pm. The hops method is used when constructing the for() loop, which returns an integer value of the hop count. We then assign this to $i and loop through all hop and print statistics, using the methods hop_query_host for the IP address of the host, hop_query_time for the time taken to reach the host, and hop_query_stat that returns the status of the query as an integer value (on our lab machines, it is returned in milliseconds), which can be mapped to the export list of Net::Traceroute according to the module's documentation. Now, let's test this trace route program with a domain and check the output: root@wnld960:~# sudo perl tracert.pl weaknetlabs.com Tracing route to weaknetlabs.com IP: 10.0.0.1 hop time: 0.724ms hop status: 0 query count: 3 IP: 68.85.73.29 hop time: 14.096ms hop status: 0 query count: 3 IP: 69.139.195.37 hop time: 19.173ms hop status: 0 query count: 3 IP: 68.86.94.189 hop time: 31.102ms hop status: 0 query count: 3 IP: 68.86.87.170 hop time: 27.42ms hop status: 0 query count: 3 IP: 50.242.150.186 hop time: 27.808ms hop status: 0 query count: 3 IP: 144.232.20.144 hop time: 33.688ms hop status: 0 query count: 3 IP: 144.232.25.30 hop time: 38.718ms hop status: 0 query count: 3 IP: 144.232.229.46 hop time: 31.242ms hop status: 0 query count: 3 IP: 144.232.9.82 hop time: 99.124ms hop status: 0 query count: 3 IP: 198.144.36.192 hop time: 30.964ms hop status: 0 query count: 3 root@wnld960:~# The output from tracert.pl is just as we expected using the traceroute program of the Linux shell. This functionality can be easily built right into our port scanner application. Shodan Shodan is an online resource that can be used for hardware searching within a specific domain. For instance, a search for hostname:<domain> will provide all the hardware entities found within this specific domain. Shodan is both a public and open source resource for intelligence. Harnessing the full power of Shodan and returning a multipage query is not free. For the examples in this article, the first page of the query results, which are free, were sufficient to provide a suitable amount of information. The returned output is XML, and Perl has some great utilities to parse XML. Luckily, for the purpose of our example, Shodan offers an example query for us to use as export_sample.xml. This XML file contains only one node per host, labeled host. This node contains attributes for the corresponding host and we will use the XML::LibXML::Node class from the XML::LibXML::Node Perl module. First, we will download the XML file and use XML::LibXML to open the local file with the parse_file() method, as shown in the following code: #!/usr/bin/perl -w use strict; use XML::LibXML; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file("export_sample.xml"); foreach my $host ($doc->findnodes('/shodan/host')) { print "Host Found:n"; my @attribs = $host->attributes('/shodan/host'); foreach my $host (@attribs){ # get host attributes print $host =~ m/([^=]+)=.*/," => "; print $host =~ m/.*"([^"]+)"/,"n"; } # next print "nn"; } The preceding Perl program will open the export_sample.xml file and navigate through the host nodes using the simple xpath of /shodan/host. For each <host> node, we call the attribute's method from the XML::LibXML::Node class, which returns an array of all attributes with information such as the IP address, hostname, and more. We then run a regular expression pattern on the $host string to parse out the key, and again with another regexp to get the value. Let's see how this returns data from our sample XML file from ShodanHQ.com: root@wnld960:~#perl shodan.pl Host Found: hostnames => internetdevelopment.ro ip => 109.206.71.21 os => Linux recent 2.4 port => 80 updated => 16.03.2010 Host Found: ip => 113.203.71.21 os => Linux recent 2.4 port => 80 updated => 16.03.2010 Host Found: hostnames => ip-173-201-71-21.ip.secureserver.net ip => 173.201.71.21 os => Linux recent 2.4 port => 80 updated => 16.03.2010 The preceding output is from our shodan.pl Perl program. It loops through all host nodes and prints the attributes. As we can see, Shodan can provide us with some very useful information that we can possibly use to exploit later in our penetration testing. It's also easy to see, without going into elementary Perl coding examples, that we can find exactly what we are looking for from an XML object's attributes using this simple method. We can use this code for other resources as well. More intelligence Gaining information about the actual physical address is also important during a penetration test. Sure, this is public information, but where do we find it? Well, the PTES describes how most states require a legal entity of a company to register with the State Division, which can provide us with a one-stop go-to place for the physical address information, entity ID, service of process agent information, and more. This can be very useful information on our client target. If obtained, we can extend this intelligence by finding out more about the property owners for physical penetration testing and social engineering by checking the city/county's department of land records, real estate, deeds, or even mortgages. All of this data, if hosted on the Web, can be gathered by automated Perl programs, as we did in the example sections of this article using LWP::UserAgent. Summary As we have seen, being creative with our information-gathering techniques can really shine with the power of regular expressions and the ability to spider links. As we learned in the introduction, it's best to do an automated OSINT gathering process along with a manual process because both processes can reveal information that one might have missed. Resources for Article: Further resources on this subject: Ruby and Metasploit Modules [article] Linux Shell Scripting – various recipes to help you [article] Linux Shell Script: Logging Tasks [article]
Read more
  • 0
  • 0
  • 16836
Modal Close icon
Modal Close icon