How-To Tutorials

article-image-gnu-octave-data-analysis-examples

28 Jun 2011

7 min read

GNU Octave: data analysis examples

28 Jun 2011

Loading data files When performing a statistical analysis of a particular problem, you often have some data stored in a file. You can save your variables (or the entire workspace) using different file formats and then load them back in again. Octave can, of course, also load data from files generated by other programs. There are certain restrictions when you do this which we will discuss here. In the following matter, we will only consider ASCII files, that is, readable text files. When you load data from an ASCII file using the load command, the data is treated as a two-dimensional array. We can then think of the data as a matrix where lines represent the matrix rows and columns the matrix columns. For this matrix to be well defined, the data must be organized such that all the rows have the same number of columns (and therefore the columns the same number of rows). For example, the content of a file called series.dat can be: Next we to load this into Octave's workspace: octave:1> load -ascii series.dat; whereby the data is stored in the variable named series. In fact, Octave is capable of loading the data even if you do not specify the ASCII format. The number of rows and columns are then: octave:2> size(series) ans = 4 3 I prefer the file extension .dat, but again this is optional and can be anything you wish, say .txt, .ascii, .data, or nothing at all. In the data files you can have: Octave comments Data blocks separated by blank lines (or equivalent empty rows) Tabs or single and multi-space for number separation Thus, the following data file will successfully load into Octave: # First block 1 232 334 2 245 334 3 456 342 4 555 321 # Second block 1 231 334 2 244 334 3 450 341 4 557 327 The resulting variable is a matrix with 8 rows and 3 columns. If you know the number of blocks or the block sizes, you can then separate the blocked-data. Now, the following data stored in the file bad.dat will not load into Octave's workspace: 1 232.1 334 2 245.2 3 456.23 4 555.6 because line 1 has three columns whereas lines 2-4 have two columns. If you try to load this file, Octave will complain: octave:3> load -ascii bad.dat error: load: bad.dat: inconsisitent number of columns near line 2 error:load: unable to extract matrix size from file 'bad.dat' Simple descriptive statistics Consider an Octave function mcintgr and its vectorized version mcintgrv. This function can evaluate the integral for a mathematical function f in some interval [a; b] where the function is positive. The Octave function is based on the Monte Carlo method and the return value, that is, the integral, is therefore a stochastic variable. When we calculate a given integral, we should as a minimum present the result as a mean or another appropriate measure of a central value together with an associated statistical uncertainty. This is true for any other stochastic variable, whether it is the height of the pupils in class, length of a plant's leaves, and so on. In this section, we will use Octave for the most simple statistical description of stochastic variables. Histogram and moments Let us calculate the integral given in Equation (5.9) one thousand times using the vectorized version of the Monte Carlo integrator: octave:4> for i=1:1000 > s(i) = mcintgrv("sin", 0, pi, 1000); > endfor The array s now contains a sequence of numbers which we know are approximately 2. Before we make any quantitative statistical description, it is always a good idea to first plot a histogram of the data as this gives an approximation to the true underlying probability distribution of the variable s. The easiest way to do this is by using Octave's hist function which can be called using: octave:5> hist(s, 30, 1) The first argument, s, to hist is the stochastic variable, the second is the number of bins that s should be grouped into (here we have used 30), and the third argument gives the sum of the heights of the histogram (here we set it to 1). The histogram is shown in the figure below. If hist is called via the command hist(s), s is grouped into ten bins and the sum of the heights of the histogram is equal to sum(s). From the figure, we see that mcintgrv produces a sequence of random numbers that appear to be normal (or Gaussian) distributed with a mean of 2. This is what we expected. It then makes good sense to describe the variable via the sample mean defined as: where N is the number of samples (here 1000) and si the i'th data point, as well as the sample variance given by: The variance is a measure of the distribution width and therefore an estimate of the statistical uncertainty of the mean value. Sometimes, one uses the standard deviation instead of the variance. The standard deviation is simply the square root of the variance To calculate the sample mean, sample variance, and the standard deviation in Octave, you use: octave:6> mean(s) ans = 1.9999 octave:7> var(s) ans = 0.002028 octave:8> std(s) ans = 0.044976 In the statistical description of the data, we can also include the skewness which measures the symmetry of the underlying distribution around the mean. If it is positive, it is an indication that the distribution has a long tail stretching towards positive values with respect to the mean. If it is negative, it has a long negative tail. The skewness is often defined as: We can calculate this in Octave via: octave:9> skewness(s) ans = -0.15495 This result is a bit surprising because we would assume from the histogram that the data set represents numbers picked from a normal distribution which is symmetric around the mean and therefore has zero skewness. It illustrates an important point—be careful to use the skewness as a direct measure of the distributions symmetry—you need a very large data set to get a good estimate. You can also calculate the kurtosis which measures the flatness of the sample distribution compared to a normal distribution. Negative kurtosis indicates a relative flatter distribution around the mean and a positive kurtosis that the sample distribution has a sharp peak around the mean. The kurtosis is defined by the following: It can be calculated by the kurtosis function. octave:10> kurtosis(s) ans = -0.02310 The kurtosis has the same problem as the skewness—you need a very large sample size to obtain a good estimate. Sample moments As you may know, the sample mean, variance, skewness, and kurtosis are examples of sample moments. The mean is related to the first moment, the variance the second moment, and so forth. Now, the moments are not uniquely defined. One can, for example, define the k'th absolute sample moment pka and k'th central sample moment pkc as: Notice that the first absolute moment is simply the sample mean, but the first central sample moment is zero. In Octave, you can easily retrieve the sample moments using the moment function, for example, to calculate the second central sample moment you use: octave:11> moment(s, 2, 'c') ans = 0.002022 Here the first input argument is the sample data, the second defines the order of the moment, and the third argument specifies whether we want the central moment 'c' or absolute moment 'a' which is the default. Compare the output with the output from Command 7—why is it not the same?

0
0
16254

article-image-ms-access-queries-oracle-sql-developer-12-tool

Packt

17 Mar 2010

12 min read

MS Access Queries with Oracle SQL Developer 1.2 Tool

Packt

17 Mar 2010

12 min read

In my previous article with the Oracle SQL Developer 1.1, I discussed the installation and features of this stand-alone GUI product which can be used to query several database products. Connecting to an Oracle Xe 10G was also described. The many things you do in Oracle 10G XE can also be carried out with the Oracle SQL Developer. It is expected to enhance productivity in your Oracle applications. You can use Oracle SQL Developer to connect, run, and debug SQL, SQL*Plus and PL/SQL. It can run on at least three different operating systems. Using this tool you can connect to Oracle, SQL Server, MySql and MS Access databases. In this article, you will learn how to install the Oracle SQL Developer 1.2 and connect to an MS Access database. The 1.2 version has several features that were not present in version 1.1 especially regarding Migration from other products. Downloading and installing the Oracle SQL Developer Go to the Oracle site (you need to be registered to download) and after accepting the license agreement you will be able to download sqldeveloper-1.2.2998.zip, a 77MB download if you do not have JDK1.5 already installed. You may place this in any directory. From the unzipped contents, double-click on the SQLDeveloper.exe. The User Interface On a Windows machine, you may get a security warning which you may safely override and click on Run. This opens up the splash window shown in the next picture followed by the Oracle SQL Developer interface shown in the picture that follows. Figure 1 The main window The main window of this tool is shown in the next picture. Figure 2 It has a main menu at the very top where you can access File, Edit, View, Navigate, Run, Debug, Source, Migration, Tools and Help menus. The menu item Migration has been added in this new version. Immediately below the main menu on the left, you have a tabbed window with two tabs, Connections, and Reports. This will be the item you have to contend with since most things start only after establishing a connection. The connection brings with it the various related objects in the databases. View Menu The next picture shows the drop-down of the View main menu, where you can see other details such as links to the debugger, reports, connections, and snippets. In this new version, many more items have been added such as Captured Objects, Converted Objects, and Find DB Object. Figure 3 Snippets are often-used SQL statements or clauses that you may want to insert. You may also save your snippets by clicking on the bright green plus sign in the window shown, which opens up the superposed Save Snippet window. Figure 4 In the Run menu item, you can run files as well as look at the Execution Profile. Debug Menu The debug menu item has all the necessary hooks to toggle break points: step into, step over, step out and step to End of Method, etc., including garbage collection and clean up as shown in the next picture. Figure 5 Tools Menu Tools give access to External Tools that can be launched, Exports both DDL and data, schema diff, etc. as shown in the next picture. Figure 6 Help gives you both full-text search and indexed search. This is an important area which you must visit; you can also update the help. Figure 7 About Menu The About drop-down menu item in the above figure opens up the following window where you have complete information about this product that includes version information, properties, and extensions. Figure 8 Migration Menu As mentioned earlier the Migration is a new item in the main menu and its drop-down menu elements are shown in the next picture. It even has a menu item to make a quick migration of all recent versions of MS Access (97, 2000, 2002, and 2003). The Repository Management item is another very useful feature. The MySQL and SQL Server Offline Capture menu item can capture database create scripts from several versions of MySQL and MS SQL Server by locating them on the machine. Figure 9 Connecting to a Microsoft Access Database If you are interested in Oracle 10G XE it will be helpful if you refresh your Oracle 10G XE knowledge or read the several Oracle 10G XE articles whose links are shown on the author’s blog. This is a good place for you to look at new developments, scripts, UI description, etc. This section, however, deals with connecting to an MS Access database. Click on the "Connections" icon with the bright green plus sign as shown in the next figure. Figure 10 This opens up the next window, New/Select Database Connection. This is where you must supply all the information. As you can see it has identified a resident (that is a local Oracle 10G XE server) Oracle 10G XE on the machine. Of course, you need to configure it further. In addition to Oracle, it can connect to MySQL, MS Access, and SQL Server as well. This interface has not changed much from version 1.1; you have the same control elements. Figure 11 On the left-hand side of this window, you will generate the Connection Name and Connection Details once you fill in the appropriate information on the right. Connection name is what you supply; to get connected you need to have a username and password as well. If you want, you can save the password to avoid providing it again and again. At the bottom of the screen, you can save the connection, test it and connect to it. There is also access to online help. In the above window, click on the tab, in the middle of the page, Access. The following window opens in which all you need to do is to use the Browse button to locate the Microsoft Access Database on your machine (windows default for mdb files is My Documents). Figure 12 Hitting the Browse button opens the window, Open with the default location, My Documents—the default directory for MDB files. Figure 13 Choosing a database Charts.mdb and clicking the Open button brings the file pointer to the New / Select Database Connection in the box to the left of the Browse button. When you click on the Test button if the connection is OK you should get an appropriate message. However for the Charts.mdb file you get the following error. Figure 14 The software is complaining about the lack of read access to the system tables. Providing read access to System tables. There are a couple of System tables in MS Access which are usually hidden but can be displayed using Tools option in MS Access. Figure 15 In the View tab if you place a check mark for System objects then you will see the following tables. The System tables are as shown in the red rectangle. Figure 16 If you need to modify the security settings for these tables you can do so as shown in the next figure by following the trail, Tools  Security User and Group permissions. Figure 17 Click on the User and Group Permissions menu item which opens the next window Users and Group Permissions shown here, Figure 18 For the user who is Admin, scroll through each of the system tables and place a check mark for Read Design and Read Data check boxes. Click on the OK button and close the application. Now you again use the Browse button to locate the Charts.mdb file after providing a name for the connection at the top of the New / Select Database Connection page. For this tutorial MyCharts was chosen as the name for the connection. Once this file name appears in the box to the left of the Browse button, click on the Test button. This message screen is very fast (appears and disappears). If there is a problem, it will bring up the message as before. Now click on the Connect button at the bottom of the screen in the New / Select Database Connection page window. This immediately adds the connection MyCharts to the Connections folder shown in the left. The + sign can be clicked to expand all the objects in the Charts.mdb database as shown in the next figure. Figure 19 You can further expand the Table nodes to show the data that the table contains as shown in the next figure for the Portfolio table. Figure 20 The Relationships tab in the above figure shows related and referenced objects as shown. This is just a lone table with no relationships established and therefore none showing. Figure 21 It may be noted that the Oracle SQL Developer can only connect to MDB files. It cannot connect to Microsoft Access projects (ADP files), or the new MS Access 2007 file types. Using the SQL Interface SQL Statements are run from the SQL Worksheet which can be displayed by right clicking the connection and choosing Open SQL Worksheet item from the drop-down list as shown. Figure 22 You will type in the SQL queries in area below Enter SQL Statement label in the above figure (now hidden behind the drop-down menu). Making a new connection with more tables and a relationship In order to run a few simple queries on the connected database, three more tables were imported into Charts.mdb after establishing a relationship between the new tables in the access database as shown in the following figure. Figure 23 Another connection named, NewCharts was created in Oracle SQL Developer. The connection string that SQL Developer will take for NewCharts is of the following format (some white spaces were introduced into the connection string shown to get rid of MS Word warnings). @jdbc:odbc: Driver= {Microsoft Access Driver (*.mdb)}; DBQ=C:Documents and SettingsJayMy DocumentsCharts.mdb; DriverID=22;READONLY=false} This string can be reviewed after a connection is established in the New / Select Database Connection window as shown in the next figure. Figure 24 A simple Query Let us look at a very simple query using the PrincetonTemp table. After entering the query you can execute the statement by clicking on the right pointing green arrowhead as shown. The result of running this query will appear directly below the Enter SQL Statement window as shown. Figure 25 Just above the Enter SQL Statement label is the SQL Toolbar displaying several icons (left to right) which are as follows with the associated key board access: Execute SQL Statement(F9) ->Green arrow head Run Script(F5) Commit(F11) Rollback(F12) Cancel(CTRL+Q) SQL History(F8) Execute Explain Plan(F6) Autotrace(F10) Clear(CTRL+D) It also displays time taken to execute ->0.04255061 seconds. The bottom pane is showing the result of the query in a table format. If you would run the same query with the Run Script (F5) button you would see the result in the Script Output tab of the bottom pane. In addition to SQL you can also use the SQL Worksheet to run SQL *PLUS and PL/SQL statements with some exceptions as long as they are supported by the provider used in making the connection. Viewing relationship between tables in the SQL Developer Three tables Orders, Order Details, and Products were imported into the Charts.mdb after enforcing referential integrity relationships in the Access database as seen earlier. Will the new connection NewCharts be able to see these relationships? This question is answered in the following figure. Click on any one of these tables and you will see the Data as well as Relationships tabs in the bottom pane as shown. Figure 26 Now if you click on the Relationships tab for Products you will see the display just showing three empty columns as seen in an earlier figure. However if you click on the Order Details which really links the three tables you will see the following displayed. Figure 27 Query joining three tables The Orders, Order Details, and Products tables are related by relational integrity as seen above. The following query which chooses one or two columns from each table can be run in a new SQL worksheet. Select Products.ProductName,Orders.ShipName,Orders.OrderDate,[Order Details].Quantityfrom Products, Orders, [Order Details]where Orders.OrderID=[Order Details].OrderIDandProducts.ProductID=[Order Details] and Orders.OrdersDate >'12-31-1997' The result of running this query (only four rows of data shown) can be seen in the next figure. Figure 28 Note that the syntax must match the syntax required by the Provider ODBC, date has to be #12-31-1997# instead of ’12-21-1997’. Summary The article described the new version of the stand alone Oracle’s GUI SQL Developer tool. It can connect to couple of databases such MS Access, SQL Server, Oracle and MySQL. Its utility could have been far greater had it provided connectivity to ODBC and OLE DB. I am disappointed it did not, in this version as well. The connection to MS Access seems to bring in not only tables but the other objects except Data Access Pages, but the external applications that you can use are limited to Word, Notepad, IE etc but not a Report Viewer. These objects and their utility remains to be explored. Only a limited number of features were explored in this article and it excluded new features like Migration and Translation Scratch Editor which translates MS ACCESS, SQL Server and My SQL syntaxes to PL / SQL. These will be considered in a future article.

0
0
16249

article-image-microsoft-cognitive-services

Packt

04 Jan 2017

16 min read

Microsoft Cognitive Services

Packt

04 Jan 2017

16 min read

In this article by Leif Henning Larsen, author of the book Learning Microsoft Cognitive Services, we will look into what Microsoft Cognitive Services offer. You will then learn how to utilize one of the APIs by recognizing faces in images. Microsoft Cognitive Services give developers the possibilities of adding AI-like capabilities to their applications. Using a few lines of code, we can take advantage of powerful algorithms that would usually take a lot of time, effort, and hardware to do yourself. (For more resources related to this topic, see here.) Overview of Microsoft Cognitive Services Using Cognitive Services means you have 21 different APIs at your hand. These are in turn separated into 5 top-level domains according to what they do. They are vision, speech, language, knowledge, and search. Let's see more about them in the following sections. Vision APIs under the vision flags allows your apps to understand images and video content. It allows you to retrieve information about faces, feelings, and other visual content. You can stabilize videos and recognize celebrities. You can read text in images and generate thumbnails from videos and images. There are four APIs contained in the vision area, which we will see now. Computer Vision Using the Computer Vision API, you can retrieve actionable information from images. This means you can identify content (such as image format, image size, colors, faces, and more). You can detect whether an image is adult/racy. This API can recognize text in images and extract it to machine-readable words. It can detect celebrities from a variety of areas. Lastly, it can generate storage-efficient thumbnails with smart cropping functionality. Emotion The Emotion API allows you to recognize emotions, both in images and videos. This can allow for more personalized experiences in applications. The emotions that are detected are cross-cultural emotions: anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise. Face We have already seen the very basic example of what the Face API can do. The rest of the API revolves around the same—to detect, identify, organize, and tag faces in photos. Apart from face detection, you can see how likely it is that two faces belong to the same person. You can identify faces and also find similar-looking faces. Video The Video API is about analyzing, editing, and processing videos in your app. If you have a video that is shaky, the API allows you to stabilize it. You can detect and track faces in videos. If a video contains a stationary background, you can detect motion. The API lets you to generate thumbnail summaries for videos, which allows users to see previews or snapshots quickly. Speech Adding one of the Speech APIs allows your application to hear and speak to your users. The APIs can filter noise and identify speakers. They can drive further actions in your application based on the recognized intent. Speech contains three APIs, which we will discuss now. Bing Speech Adding the Bing Speech API to your application allows you to convert speech to text and vice versa. You can convert spoken audio to text either by utilizing a microphone or other sources in real time or by converting audio from files. The API also offer speech intent recognition, which is trained by Language Understanding Intelligent Service to understand the intent. Speaker Recognition The Speaker Recognition API gives your application the ability to know who is talking. Using this API, you can use verify that someone speaking is who they claim to be. You can also determine who an unknown speaker is, based on a group of selected speakers. Custom Recognition To improve speech recognition, you can use the Custom Recognition API. This allows you to fine-tune speech recognition operations for anyone, anywhere. Using this API, the speech recognition model can be tailored to the vocabulary and speaking style of the user. In addition to this, the model can be customized to match the expected environment of the application. Language APIs related to language allow your application to process natural language and learn how to recognize what users want. You can add textual and linguistic analysis to your application as well as natural language understanding. The following five APIs can be found in the Language area. Bing Spell Check Bing Spell Check API allows you to add advanced spell checking to your application. Language Understanding Intelligent Service (LUIS) Language Understanding Intelligent Service, or LUIS, is an API that can help your application understand commands from your users. Using this API, you can create language models that understand intents. Using models from Bing and Cortana, you can make these models recognize common requests and entities (such as places, time, and numbers). You can add conversational intelligence to your applications. Linguistic Analysis Linguistic Analysis API lets you parse complex text to explore the structure of text. Using this API, you can find nouns, verbs, and more from text, which allows your application to understand who is doing what to whom. Text Analysis Text Analysis API will help you in extracting information from text. You can find the sentiment of a text (whether the text is positive or negative). You will be able to detect language, topic, and key phrases used throughout the text. Web Language Model Using the Web Language Model (WebLM) API, you are able to leverage the power of language models trained on web-scale data. You can use this API to predict which words or sequences follow a given sequence or word. Knowledge When talking about Knowledge APIs, we are talking about APIs that allow you to tap into rich knowledge. This may be knowledge from the Web, it may be academia, or it may be your own data. Using these APIs, you will be able to explore different nuances of knowledge. The following four APIs are contained in the Knowledge area. Academic Using the Academic API, you can explore relationships among academic papers, journals, and authors. This API allows you to interpret natural language user query strings, which allows your application to anticipate what the user is typing. It will evaluate the said expression and return academic knowledge entities. Entity Linking Entity Linking is the API you would use to extend knowledge of people, places, and events based on the context. As you may know, a single word may be used differently based on the context. Using this API allows you to recognize and identify each separate entity within a paragraph based on the context. Knowledge Exploration The Knowledge Exploration API will let you add the ability to use interactive search for structured data in your projects. It interprets natural language queries and offers auto-completions to minimize user effort. Based on the query expression received, it will retrieve detailed information about matching objects. Recommendations The Recommendations API allows you to provide personalized product recommendations for your customers. You can use this API to add frequently bought together functionality to your application. Another feature you can add is item-to-item recommendations, which allow customers to see what other customers who likes this also like. This API will also allow you to add recommendations based on the prior activity of the customer. Search Search APIs give you the ability to make your applications more intelligent with the power of Bing. Using these APIs, you can use a single call to access data from billions of web pages, images videos, and news. The following five APIs are in the search domain. Bing Web Search With Bing Web Search, you can search for details in billions of web documents indexed by Bing. All the results can be arranged and ordered according to the layout you specify, and the results are customized to the location of the end user. Bing Image Search Using Bing Image Search API, you can add advanced image and metadata search to your application. Results include URLs to images, thumbnails, and metadata. You will also be able to get machine-generated captions and similar images and more. This API allows you to filter the results based on image type, layout, freshness (how new is the image), and license. Bing Video Search Bing Video Search will allow you to search for videos and returns rich results. The results contain metadata from the videos, static- or motion- based thumbnails, and the video itself. You can add filters to the result based on freshness, video length, resolution, and price. Bing News Search If you add Bing News Search to your application, you can search for news articles. Results can include authoritative image, related news and categories, information on the provider, URL, and more. To be more specific, you can filter news based on topics. Bing Autosuggest Bing Autosuggest API is a small, but powerful one. It will allow your users to search faster with search suggestions, allowing you to connect powerful search to your apps. Detecting faces with the Face API We have seen what the different APIs can do. Now we will test the Face API. We will not be doing a whole lot, but we will see how simple it is to detect faces in images. The steps we need to cover to do this are as follows: Register for a free Face API preview subscription. Add necessary NuGet packages to our project. Add some UI to the test application. Detect faces on command. Head over to https://www.microsoft.com/cognitive-services/en-us/face-api to start the process of registering for a free subscription to the Face API. By clicking on the yellow button, stating Get started for free,you will be taken to a login page. Log in with your Microsoft account, or if you do not have one, register for one. Once logged in, you will need to verify that the Face API Preview has been selected in the list and accept the terms and conditions. With that out of the way, you will be presented with the following: You will need one of the two keys later, when we are accessing the API. In Visual Studio, create a new WPF application. Following the instructions at https://www.codeproject.com/articles/100175/model-view-viewmodel-mvvm-explained, create a base class that implements the INotifyPropertyChanged interface and a class implementing the ICommand interface. The first should be inherited by the ViewModel, the MainViewModel.cs file, while the latter should be used when creating properties to handle button commands. The Face API has a NuGet package, so we need to add that to our project. Head over to NuGet Package Manager for the project we created earlier. In the Browse tab, search for the Microsoft.ProjectOxford.Face package and install the it from Microsoft: As you will notice, another package will also be installed. This is the Newtonsoft.Json package, which is required by the Face API. The next step is to add some UI to our application. We will be adding this in the MainView.xaml file. First, we add a grid and define some rows for the grid: <Grid> <Grid.RowDefinitions> <RowDefinition Height="*" /> <RowDefinition Height="20" /> <RowDefinition Height="30" /> </Grid.RowDefinitions> Three rows are defined. The first is a row where we will have an image. The second is a line for status message, and the last is where we will place some buttons: Next, we add our image element: <Image x_Name="FaceImage" Stretch="Uniform" Source="{Binding ImageSource}" Grid.Row="0" /> We have given it a unique name. By setting the Stretch parameter to Uniform, we ensure that the image keeps its aspect ratio. Further on, we place this element in the first row. Last, we bind the image source to a BitmapImage interface in the ViewModel, which we will look at in a bit. The next row will contain a text block with some status text. The text property will be bound to a string property in the ViewModel: <TextBlock x_Name="StatusTextBlock" Text="{Binding StatusText}" Grid.Row="1" /> The last row will contain one button to browse for an image and one button to be able to detect faces. The command properties of both buttons will be bound to the DelegateCommand properties in the ViewModel: <Button x_Name="BrowseButton" Content="Browse" Height="20" Width="140" HorizontalAlignment="Left" Command="{Binding BrowseButtonCommand}" Margin="5, 0, 0, 5" Grid.Row="2" /> <Button x_Name="DetectFaceButton" Content="Detect face" Height="20" Width="140" HorizontalAlignment="Right" Command="{Binding DetectFaceCommand}" Margin="0, 0, 5, 5" Grid.Row="2"/> With the View in place, make sure that the code compiles and run it. This should present you with the following UI: The last part is to create the binding properties in our ViewModel and make the buttons execute something. Open the MainViewModel.cs file. First, we define two variables: private string _filePath; private IFaceServiceClient _faceServiceClient; The string variable will hold the path to our image, while the IFaceServiceClient variable is to interface the Face API. Next we define two properties: private BitmapImage _imageSource; public BitmapImage ImageSource { get { return _imageSource; } set { _imageSource = value; RaisePropertyChangedEvent("ImageSource"); } } private string _statusText; public string StatusText { get { return _statusText; } set { _statusText = value; RaisePropertyChangedEvent("StatusText"); } } What we have here is a property for the BitmapImage mapped to the Image element in the view. We also have a string property for the status text, mapped to the text block element in the view. As you also may notice, when either of the properties is set, we call the RaisePropertyChangedEvent method. This will ensure that the UI is updated when either of the properties has new values. Next, we define our two DelegateCommand objects and do some initialization through the constructor: public ICommand BrowseButtonCommand { get; private set; } public ICommand DetectFaceCommand { get; private set; } public MainViewModel() { StatusText = "Status: Waiting for image..."; _faceServiceClient = new FaceServiceClient("YOUR_API_KEY_HERE"); BrowseButtonCommand = new DelegateCommand(Browse); DetectFaceCommand = new DelegateCommand(DetectFace, CanDetectFace); } In our constructor, we start off by setting the status text. Next, we create an object of the Face API, which needs to be created with the API key we got earlier. At last, we create the DelegateCommand object for our command properties. Note how the browse command does not specify a predicate. This means it will always be possible to click on the corresponding button. To make this compile, we need to create the functions specified in the DelegateCommand constructors—the Browse, DetectFace, and CanDetectFace functions: private void Browse(object obj) { var openDialog = new Microsoft.Win32.OpenFileDialog(); openDialog.Filter = "JPEG Image(*.jpg)|*.jpg"; bool? result = openDialog.ShowDialog(); if (!(bool)result) return; We start the Browse function by creating an OpenFileDialog object. This dialog is assigned a filter for JPEG images, and in turn it is opened. When the dialog is closed, we check the result. If the dialog was cancelled, we simply stop further execution: _filePath = openDialog.FileName; Uri fileUri = new Uri(_filePath); With the dialog closed, we grab the filename of the file selected and create a new URI from it: BitmapImage image = new BitmapImage(fileUri); image.CacheOption = BitmapCacheOption.None; image.UriSource = fileUri; With the newly created URI, we want to create a new BitmapImage interface. We specify it to use no cache, and we set the URI source the URI we created: ImageSource = image; StatusText = "Status: Image loaded..."; } The last step we take is to assign the bitmap image to our BitmapImage property, so the image is shown in the UI. We also update the status text to let the user know the image has been loaded. Before we move on, it is time to make sure that the code compiles and that you are able to load an image into the View: private bool CanDetectFace(object obj) { return !string.IsNullOrEmpty(ImageSource?.UriSource.ToString()); } The CanDetectFace function checks whether or not the detect faces button should be enabled. In this case, it checks whether our image property actually has a URI. If it does, by extension that means we have an image, and we should be able to detect faces: private async void DetectFace(object obj) { FaceRectangle[] faceRects = await UploadAndDetectFacesAsync(); string textToSpeak = "No faces detected"; if (faceRects.Length == 1) textToSpeak = "1 face detected"; else if (faceRects.Length > 1) textToSpeak = $"{faceRects.Length} faces detected"; Debug.WriteLine(textToSpeak); } Our DetectFace method calls an async method to upload and detect faces. The return value contains an array of FaceRectangles. This array contains the rectangle area for all face positions in the given image. We will look into the function we call in a bit. After the call has finished executing, we print a line with the number of faces to the debug console window: private async Task<FaceRectangle[]> UploadAndDetectFacesAsync() { StatusText = "Status: Detecting faces..."; try { using (Stream imageFileStream = File.OpenRead(_filePath)) { In the UploadAndDetectFacesAsync function, we create a Stream object from the image. This stream will be used as input for the actual call to the Face API service: Face[] faces = await _faceServiceClient.DetectAsync(imageFileStream, true, true, new List<FaceAttributeType>() { FaceAttributeType.Age }); This line is the actual call to the detection endpoint for the Face API. The first parameter is the file stream we created in the previous step. The rest of the parameters are all optional. The second parameter should be true if you want to get a face ID. The next specifies if you want to receive face landmarks or not. The last parameter takes a list of facial attributes you may want to receive. In our case, we want the age parameter to be returned, so we need to specify that. The return type of this function call is an array of faces with all the parameters you have specified: List<double> ages = faces.Select(face => face.FaceAttributes.Age).ToList(); FaceRectangle[] faceRects = faces.Select(face => face.FaceRectangle).ToArray(); StatusText = "Status: Finished detecting faces..."; foreach(var age in ages) { Console.WriteLine(age); } return faceRects; } } The first line in the previous code iterates over all faces and retrieves the approximate age of all faces. This is later printed to the debug console window, in the following foreach loop. The second line iterates over all faces and retrieves the face rectangle with the rectangular location of all faces. This is the data we return to the calling function. Add a catch clause to finish the method. In case an exception is thrown, in our API call, we catch that. You want to show the error message and return an empty FaceRectangle array. With that code in place, you should now be able to run the full example. The end result will look like the following image: The resulting debug console window will print the following text: 1 face detected 23,7 Summary In this article, we looked at what Microsoft Cognitive Services offer. We got a brief description of all the APIs available. From there, we looked into the Face API, where we saw how to detect faces in images. Resources for Article: Further resources on this subject: Auditing and E-discovery [article] The Sales and Purchase Process [article] Manage Security in Excel [article]

0
0
16238

article-image-tableau-data-extract-best-practices

Packt

12 Dec 2016

11 min read

Tableau Data Extract Best Practices

Packt

12 Dec 2016

11 min read

0
0
16226

Packt

20 Aug 2014

19 min read

Importing Dynamic Data

Packt

20 Aug 2014

19 min read

In this article by Chad Adams, author of the book Learning Python Data Visualization, we will go over the finer points of pulling data from the Web using the Python language and its built-in libraries and cover parsing XML, JSON, and JSONP data. (For more resources related to this topic, see here.) Since we now have an understanding of how to work with the pygal library and building charts and graphics in general, this is the time to start looking at building an application using Python. In this article, we will take a look at the fundamentals of pulling data from the Web, parsing the data, and adding it to our code base and formatting the data into a useable format, and we will look at how to carry those fundamentals over to our Python code. We will also cover parsing XML and JSON data. Pulling data from the Web For many non-developers, it may seem like witchcraft that developers are magically able to pull data from an online resource and integrate that with an iPhone app, or a Windows Store app, or pull data to a cloud resource that is able to generate various versions of the data upon request. To be fair, they do have a general understanding; data is pulled from the Web and formatted to their app of choice. They just may not get the full background of how that process workflow happens. It's the same case with some developers as well—many developers mainly work on a technology that only works on a locked down environment, or generally, don't use the Internet for their applications. Again, they understand the logic behind it; somehow an RSS feed gets pulled into an application. In many languages, the same task is done in various ways, usually depending on which language is used. Let's take a look at a few examples using Packt's own news RSS feed, using an iOS app pulling in data via Objective-C. Now, if you're reading this and not familiar with Objective-C, that's OK, the important thing is that we have the inner XML contents of an XML file showing up in an iPhone application: #import "ViewController.h" @interfaceViewController () @property (weak, nonatomic) IBOutletUITextView *output; @end @implementationViewController - (void)viewDidLoad { [super viewDidLoad]; // Do any additional setup after loading the view, typically from a nib. NSURL *packtURL = [NSURLURLWithString:@"http://www.packtpub.com/rss.xml"]; NSURLRequest *request = [NSURLRequestrequestWithURL:packtURL]; NSURLConnection *connection = [[NSURLConnectionalloc] initWithRequest:requestdelegate:selfstartImmediately:YES]; [connection start]; } - (void)connection:(NSURLConnection *)connection didReceiveData:(NSData *)data { NSString *downloadstring = [[NSStringalloc] initWithData:dataencoding:NSUTF8StringEncoding]; [self.outputsetText:downloadstring]; } - (void)didReceiveMemoryWarning { [superdidReceiveMemoryWarning]; // Dispose of any resources that can be recreated. } @end Here, we can see in iPhone Simulator that our XML output is pulled dynamically through HTTP from the Web to our iPhone simulator. This is what we'll want to get started with doing in Python: The XML refresher Extensible Markup Language (XML) is a data markup language that sets a series of rules and hierarchy to a data group, which is stored as a static file. Typically, servers update these XML files on the Web periodically to be reused as data sources. XML is really simple to pick up as it's similar to HTML. You can start with the document declaration in this case: <?xml version="1.0" encoding="utf-8"?> Next, a root node is set. A node is like an HTML tag (which is also called a node). You can tell it's a node by the brackets around the node's name. For example, here's a node named root: <root></root> Note that we close the node by creating a same-named node with a backslash. We can also add parameters to the node and assign a value, as shown in the following root node: <root parameter="value"></root> Data in XML is set through a hierarchy. To declare that hierarchy, we create another node and place that inside the parent node, as shown in the following code: <root parameter="value"> <subnode>Subnode's value</subnode> </root> In the preceding parent node, we created a subnode. Inside the subnode, we have an inner value called Subnode's value. Now, in programmatical terms, getting data from an XML data file is a process called parsing. With parsing, we specify where in the XML structure we would like to get a specific value; for instance, we can crawl the XML structure and get the inner contents like this: /root/subnode This is commonly referred to as XPath syntax, a cross-language way of going through an XML file. For more on XML and XPath, check out the full spec at: http://www.w3.org/TR/REC-xml/ and here http://www.w3.org/TR/xpath/ respectively. RSS and the ATOM Really simple syndication (RSS) is simply a variation of XML. RSS is a spec that defines specific nodes that are common for sending data. Typically, many blog feeds include an RSS option for users to pull down the latest information from those sites. Some of the nodes used in RSS include rss, channel, item, title, description, pubDate, link, and GUID. Looking at our iPhone example in this article from the Pulling data from the Web section, we can see what a typical RSS structure entails. RSS feeds are usually easy to spot since the spec requires the root node to be named rss for it to be a true RSS file. In some cases, some websites and services will use .rss rather than .xml; this is typically fine since most readers for RSS content will pull in the RSS data like an XML file, just like in the iPhone example. Another form of XML is called ATOM. ATOM was another spec similar to RSS, but developed much later than RSS. Because of this, ATOM has more features than RSS: XML namespacing, specified content formats (video, or audio-specific URLs), support for internationalization, and multilanguage support, just to name a few. ATOM does have a few different nodes compared to RSS; for instance, the root node to an RSS feed would be <rss>. ATOM's root starts with <feed>, so it's pretty easy to spot the difference. Another difference is that ATOM can also end in .atom or .xml. For more on the RSS and ATOM spec, check out the following sites: http://www.rssboard.org/rss-specification http://tools.ietf.org/html/rfc4287 Understanding HTTP All these samples from the RSS feed of the Packt Publishing website show one commonality that's used regardless of the technology coded in, and that is the method used to pull down these static files is through the Hypertext Transfer Protocol (HTTP). HTTP is the foundation of Internet communication. It's a protocol with two distinct types of requests: a request for data or GET and a push of data called a POST. Typically, when we download data using HTTP, we use the GET method of HTTP in order to pull down the data. The GET request will return a string or another data type if we mention a specific type. We can either use this value directly or save to a variable. With a POST request, we are sending values to a service that handles any incoming values; say we created a new blog post's title and needed to add to a list of current titles, a common way of doing that is with URL parameters. A URL parameter is an existing URL but with a suffixed key-value pair. The following is a mock example of a POST request with a URL parameter: http://www.yourwebsite.com/blogtitles/?addtitle=Your%20New%20Title If our service is set up correctly, it will scan the POST request for a key of addtitle and read the value, in this case: Your New Title. We may notice %20 in our title for our request. This is an escape character that allows us to send a value with spaces; in this case, %20 is a placehoder for a space in our value. Using HTTP in Python The RSS samples from the Packt Publishing website show a few commonalities we use in programming when working in HTTP; one is that we always account for the possibility of something potentially going wrong with a connection and we always close our request when finished. Here's an example on how the same RSS feed request is done in Python using a built-in library called urllib2: #!/usr/bin/env python # -*- coding: utf-8 -*- import urllib2 try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') #Read the file to a variable we named 'xml' xml = response.read() #print to the console. print(xml) #Finally, close our open network. response.close() except: #If we have an issue show a message and alert the user. print('Unable to connect to RSS...') If we look in the following console output, we can see the XML contents just as we saw in our iOS code example: In the example, notice that we wrapped our HTTP request around a try except block. For those coming from another language, except can be considered the same as a catch statement. This allows us to detect if an error occurs, which might be an incorrect URL or lack of connectivity, for example, to programmatically set an alternate result to our Python script. Parsing XML in Python with HTTP With these examples including our Python version of the script, it's still returning a string of some sorts, which by itself isn't of much use to grab values from the full string. In order to grab specific strings and values from an XML pulled through HTTP, we need to parse it. Luckily, Python has a built-in object in the Python main library for this, called as ElementTree, which is a part of the XML library in Python. Let's incorporate ElementTree into our example and see how that works: # -*- coding: utf-8 -*- import urllib2 from xml.etree import ElementTree try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') tree = ElementTree.parse(response) root = tree.getroot() #Create an 'Element' group from our XPATH using findall. news_post_title = root.findall("channel//title") #Iterate in all our searched elements and print the inner text for each. for title in news_post_title: print title.text #Finally, close our open network. response.close() except Exception as e: #If we have an issue show a message and alert the user. print(e) The following screenshot shows the results of our script: As we can see, our output shows each title for each blog post. Notice how we used channel//item for our findall() method. This is using XPath, which allows us to write in a shorthand manner on how to iterate an XML structure. It works like this; let's use the http://www.packtpub.com feed as an example. We have a root, followed by channel, then a series of title elements. The findall() method found each element and saved them as an Element type specific to the XML library ElementTree uses in Python, and saved each of those into an array. We can then use a for in loop to iterate each one and print the inner text using the text property specific to the Element type. You may notice in the recent example that I changed except with a bit of extra code and added Exception as e. This allows us to help debug issues and print them to a console or display a more in-depth feedback to the user. An Exception allows for generic alerts that the Python libraries have built-in warnings and errors to be printed back either through a console or handled with the code. They also have more specific exceptions we can use such as IOException, which is specific for working with file reading and writing. About JSON Now, another data type that's becoming more and more common when working with web data is JSON. JSON is an acronym for JavaScript Object Notation, and as the name implies, is indeed true JavaScript. It has become popular in recent years with the rise of mobile apps, and Rich Internet Applications (RIA). JSON is great for JavaScript developers; it's easier to work with when working in HTML content, compared to XML. Because of this, JSON is becoming a more common data type for web and mobile application development. Parsing JSON in Python with HTTP To parse JSON data in Python is a pretty similar process; however, in this case, our ElementTree library isn't needed, since that only works with XML. We need a library designed to parse JSON using Python. Luckily, the Python library creators already have a library for us, simply called json. Let's build an example similar to our XML code using the json import; of course, we need to use a different data source since we won't be working in XML. One thing we may note is that there aren't many public JSON feeds to use, many of which require using a code that gives a developer permission to generate a JSON feed through a developer API, such as Twitter's JSON API. To avoid this, we will use a sample URL from Google's Books API, which will show demo data of Pride and Prejudice, Jane Austen. We can view the JSON (or download the file), by typing in the following URL: https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ Notice the API uses HTTPS, many JSON APIs are moving to secure methods of transmitting data, so be sure to include the secure in HTTP, with HTTPS. Let's take a look at the JSON output: { "kind": "books#volume", "id": "s1gVAAAAYAAJ", "etag": "yMBMZ85ENrc", "selfLink": "https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ", "volumeInfo": { "title": "Pride and Prejudice", "authors": [ "Jane Austen" ], "publisher": "C. Scribner's Sons", "publishedDate": "1918", "description": "Austen's most celebrated novel tells the story of Elizabeth Bennet, a bright, lively young woman with four sisters, and a mother determined to marry them to wealthy men. At a party near the Bennets' home in the English countryside, Elizabeth meets the wealthy, proud Fitzwilliam Darcy. Elizabeth initially finds Darcy haughty and intolerable, but circumstances continue to unite the pair. Mr. Darcy finds himself captivated by Elizabeth's wit and candor, while her reservations about his character slowly vanish. The story is as much a social critique as it is a love story, and the prose crackles with Austen's wry wit.", "readingModes": { "text": true, "image": true }, "pageCount": 401, "printedPageCount": 448, "dimensions": { "height": "18.00 cm" }, "printType": "BOOK", "averageRating": 4.0, "ratingsCount": 433, "contentVersion": "1.1.5.0.full.3", "imageLinks": { "smallThumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec =frontcover&img=1&zoom=5&edge=curl&imgtk=AFLRE73F8btNqKpVjGX6q7V3XS77 QA2PftQUxcEbU3T3njKNxezDql_KgVkofGxCPD3zG1yq39u0XI8s4wjrqFahrWQ- 5Epbwfzfkoahl12bMQih5szbaOw&source=gbs_api", "thumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=1&edge=curl&imgtk=AFLRE70tVS8zpcFltWh_ 7K_5Nh8BYugm2RgBSLg4vr9tKRaZAYoAs64RK9aqfLRECSJq7ATs_j38JRI3D4P48-2g_ k4-EY8CRNVReZguZFMk1zaXlzhMNCw&source=gbs_api", "small": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec =frontcover&img=1&zoom=2&edge=curl&imgtk=AFLRE71qcidjIs37x0jN2dGPstn 6u2pgeXGWZpS1ajrGgkGCbed356114HPD5DNxcR5XfJtvU5DKy5odwGgkrwYl9gC9fo3y- GM74ZIR2Dc-BqxoDuUANHg&source=gbs_api", "medium": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=3&edge=curl&imgtk=AFLRE73hIRCiGRbfTb0uNIIXKW 4vjrqAnDBSks_ne7_wHx3STluyMa0fsPVptBRW4yNxNKOJWjA4Od5GIbEKytZAR3Nmw_ XTmaqjA9CazeaRofqFskVjZP0&source=gbs_api", "large": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=4&edge=curl&imgtk=AFLRE73mlnrDv-rFsL- n2AEKcOODZmtHDHH0QN56oG5wZsy9XdUgXNnJ_SmZ0sHGOxUv4sWK6GnMRjQm2eEwnxIV4dcF9eBhghMcsx -S2DdZoqgopJHk6Ts&source=gbs_api", "extraLarge": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=6&edge=curl&imgtk=AFLRE73KIXHChszn TbrXnXDGVs3SHtYpl8tGncDPX_7GH0gd7sq7SA03aoBR0mDC4-euzb4UCIDiDNLYZUBJwMJxVX_ cKG5OAraACPLa2QLDcfVkc1pcbC0&source=gbs_api" }, "language": "en", "previewLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api", "infoLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api", "canonicalVolumeLink": "http://books.google.com/books/about/ Pride_and_Prejudice.html?hl=&id=s1gVAAAAYAAJ" }, "layerInfo": { "layers": [ { "layerId": "geo", "volumeAnnotationsVersion": "6" } ] }, "saleInfo": { "country": "US", "saleability": "FREE", "isEbook": true, "buyLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&buy=&source=gbs_api" }, "accessInfo": { "country": "US", "viewability": "ALL_PAGES", "embeddable": true, "publicDomain": true, "textToSpeechPermission": "ALLOWED", "epub": { "isAvailable": true, "downloadLink": "http://books.google.com/books/download /Pride_and_Prejudice.epub?id=s1gVAAAAYAAJ&hl=&output=epub &source=gbs_api" }, "pdf": { "isAvailable": true, "downloadLink": "http://books.google.com/books/download/Pride_and_Prejudice.pdf ?id=s1gVAAAAYAAJ&hl=&output=pdf&sig=ACfU3U3dQw5JDWdbVgk2VRHyDjVMT4oIaA &source=gbs_api" }, "webReaderLink": "http://books.google.com/books/reader ?id=s1gVAAAAYAAJ&hl=&printsec=frontcover& output=reader&source=gbs_api", "accessViewStatus": "FULL_PUBLIC_DOMAIN", "quoteSharingAllowed": false } } One downside to JSON is that it can be hard to read complex data. So, if we run across a complex JSON feed, we can visualize it using a JSON Visualizer. Visual Studio includes one with all its editions, and a web search will also show multiple online sites where you can paste JSON and an easy-to-understand data structure will be displayed. Here's an example using http://jsonviewer.stack.hu/ loading our example JSON URL: Next, let's reuse some of our existing Python code using our urllib2 library to request our JSON feed, and then we will parse it with the Python's JSON library. Let's parse the volumeInfo node of the book by starting with the JSON (root) node that is followed by volumeInfo as the subnode. Here's our example from the XML section, reworked using JSON to parse all child elements: # -*- coding: utf-8 -*- import urllib2 import json try: #Set a URL variable. url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ' #Open the file via HTTP. response = urllib2.urlopen(url) #Read the request as one string. bookdata = response.read() #Convert the string to a JSON object in Python. data = json.loads(bookdata) for r in data ['volumeInfo']: print r #Close our response. response.close() except: #If we have an issue show a message and alert the user. print('Unable to connect to JSON API...') Here's our output. It should match the child nodes of volumeInfo, which it does in the output screen, as shown in the following screenshot: Well done! Now, let's grab the value for title. Look at the following example and notice we have two brackets: one for volumeInfo and another for title. This is similar to navigating our XML hierarchy: # -*- coding: utf-8 -*- import urllib2 import json try: #Set a URL variable. url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ' #Open the file via HTTP. response = urllib2.urlopen(url) #Read the request as one string. bookdata = response.read() #Convert the string to a JSON object in Python. data = json.loads(bookdata) print data['volumeInfo']['title'] #Close our response. response.close() except Exception as e: #If we have an issue show a message and alert the user. #'Unable to connect to JSON API...' print(e) The following screenshot shows the results of our script: As you can see in the preceding screenshot, we return one line with Pride and Prejudice parsed from our JSON data. About JSONP JSONP, or JSON with Padding, is actually JSON but it is set up differently compared to traditional JSON files. JSONP is a workaround for web cross-browser scripting. Some web services can serve up JSONP rather than pure JSON JavaScript files. The issue with that is JSONP isn't compatible with many JSON Python-based parsers including one covered here, so you will want to avoid JSONP style JSON whenever possible. So how can we spot JSONP files; do they have a different extension? No, it's simply a wrapper for JSON data; here's an example without JSONP: /* *Regular JSON */ { authorname: 'Chad Adams' } The same example with JSONP: /* * JSONP */ callback({ authorname: 'Chad Adams' }); Notice we wrapped our JSON data with a function wrapper, or a callback. Typically, this is what breaks in our parsers and is a giveaway that this is a JSONP-formatted JSON file. In JavaScript, we can even call it in code like this: /* * Using JSONP in JavaScript */ callback = function (data) { alert(data.authorname); }; JSONP with Python We can get around a JSONP data source though, if we need to; it just requires a bit of work. We can use the str.replace() method in Python to strip out the callback before running the string through our JSON parser. If we were parsing our example JSONP file in our JSON parser example, the string would look something like this: #Convert the string to a JSON object in Python. data = json.loads(bookdata.replace('callback(', '').) .replace(')', '')) Summary In this article, we covered HTTP concepts and methodologies for pulling strings and data from the Web. We learned how to do that with Python using the urllib2 library, and parsed XML data and JSON data. We discussed the differences between JSON and JSONP, and how to work around JSONP if needed. Resources for Article: Further resources on this subject: Introspecting Maya, Python, and PyMEL [Article] Exact Inference Using Graphical Models [Article] Automating Your System Administration and Deployment Tasks Over SSH [Article]

0
0
16218

article-image-working-forensic-evidence-container-recipes

Packt

07 Mar 2018

13 min read

Working with Forensic Evidence Container Recipes

Packt

07 Mar 2018

13 min read

In this article by Preston Miller and Chapin Bryce, authors of Learning Python for Forensics, we introduce a recipe from our upcoming book, Python Digital Forensics Cookbook. In Python Digital Forensics Cookbook, each chapter is comprised of many scripts, or recipes, falling under specific themes. The "Iterating Through Files" recipe shown here, is from our chapter that introduces the Sleuth Kit's Python binding's, pystk3, and other libraries, to programmatically interact with forensic evidence containers. Specifically, this recipe shows how to access a forensic evidence container and iterate through all of its files to create an active file listing of its contents. (For more resources related to this topic, see here.) If you are reading this article, it goes without saying that Python is a key tool in DFIR investigations. However, most examiners, are not familiar with or do not take advantage of the Sleuth Kit's Python bindings. Imagine being able to run your existing scripts against forensic containers without needing to mount them or export loose files. This recipe continues to introduce the library, pytsk3, that will allow us to do just that and take our scripting capabilities to the next level. In this recipe, we learn how to recurse through the filesystem and create an active file listing. Oftentimes, one of the first questions we, as the forensic examiner, are asked is "What data is on the device?". An active file listing comes in handy here. Creating a file listing of loose files is a very straightforward task in Python. However, this will be slightly more complicated because we are working with a forensic image rather than loose files. This recipe will be a cornerstone for future scripts as it will allow us to recursively access and process every file in the image. As we continue to introduce new concepts and features from the Sleuth Kit, we will add new functionality to our previous recipes in an iterative process. In a similar way, this recipe will become integral in future recipes to iterate through directories and process files. Getting started Refer to the Getting started section in the Opening Acquisitions recipe for information on the build environment and setup details for pytsk3 and pyewf. All other libraries used in this script are present in Python's standard library. How to do it... We perform the following steps in this recipe: Import argparse, csv, datetime, os, pytsk3, pyewf, and sys; Identify if the evidence container is a raw (DD) image or an EWF (E01) container; Access the forensic image using pytsk3; Recurse through all directories in each partition; Store file metadata in a list; And write the active file list to a CSV. How it works... This recipe's command-line handler takes three positional arguments: EVIDENCE_FILE, TYPE, OUTPUT_CSV which represents the path to the evidence file, the type of evidence file, and the output CSV file, respectively. Similar to the previous recipe, the optional p switch can be supplied to specify a partition type. We use the os.path.dirname() method to extract the desired output directory path for the CSV file and, with the os.makedirs() function, create the necessary output directories if they do not exist. if __name__ == '__main__': # Command-line Argument Parser parser = argparse.ArgumentParser() parser.add_argument("EVIDENCE_FILE", help="Evidence file path") parser.add_argument("TYPE", help="Type of Evidence", choices=("raw", "ewf")) parser.add_argument("OUTPUT_CSV", help="Output CSV with lookup results") parser.add_argument("-p", help="Partition Type", choices=("DOS", "GPT", "MAC", "SUN")) args = parser.parse_args() directory = os.path.dirname(args.OUTPUT_CSV) if not os.path.exists(directory) and directory != "": os.makedirs(directory) Once we have validated the input evidence file by checking that it exists and is a file, the four arguments are passed to the main() function. If there is an issue with initial validation of the input, an error is printed to the console before the script exits. if os.path.exists(args.EVIDENCE_FILE) and os.path.isfile(args.EVIDENCE_FILE): main(args.EVIDENCE_FILE, args.TYPE, args.OUTPUT_CSV, args.p) else: print("[-] Supplied input file {} does not exist or is not a file".format(args.EVIDENCE_FILE)) sys.exit(1) In the main() function, we instantiate the volume variable with None to avoid errors referencing it later in the script. After printing a status message to the console, we check if the evidence type is an E01 to properly process it and create a valid pyewf handle as demonstrated in more detail in the Opening Acquisitions recipe. Refer to that recipe for more details as this part of the function is identical. The end result is the creation of the pytsk3 handle, img_info, for the user supplied evidence file. def main(image, img_type, output, part_type): volume = None print "[+] Opening {}".format(image) if img_type == "ewf": try: filenames = pyewf.glob(image) except IOError, e: print "[-] Invalid EWF format:n {}".format(e) sys.exit(2) ewf_handle = pyewf.handle() ewf_handle.open(filenames) # Open PYTSK3 handle on EWF Image img_info = ewf_Img_Info(ewf_handle) else: img_info = pytsk3.Img_Info(image) Next, we attempt to access the volume of the image using the pytsk3.Volume_Info() method by supplying it the image handle. If the partition type argument was supplied, we add its attribute ID as the second argument. If we receive an IOError when attempting to access the volume, we catch the exception as e and print it to the console. Notice, however, that we do not exit the script as we often do when we receive an error. We'll explain why in the next function. Ultimately, we pass the volume, img_info, and output variables to the openFS() method. try: if part_type is not None: attr_id = getattr(pytsk3, "TSK_VS_TYPE_" + part_type) volume = pytsk3.Volume_Info(img_info, attr_id) else: volume = pytsk3.Volume_Info(img_info) except IOError, e: print "[-] Unable to read partition table:n {}".format(e) openFS(volume, img_info, output) The openFS() method tries to access the filesystem of the container in two ways. If the volume variable is not None, it iterates through each partition, and if that partition meets certain criteria, attempts to open it. If, however, the volume variable is None, it instead tries to directly call the pytsk3.FS_Info() method on the image handle, img. As we saw, this latter method will work and give us filesystem access for logical images whereas the former works for physical images. Let's look at the differences between these two methods. Regardless of the method, we create a recursed_data list to hold our active file metadata. In the first instance, where we have a physical image, we iterate through each partition and check that is it greater than 2,048 sectors and does not contain the words "Unallocated", "Extended", or "Primary Table" in its description. For partitions meeting these criteria, we attempt to access its filesystem using the FS_Info() function by supplying the pytsk3 img object and the offset of the partition in bytes. If we are able to access the filesystem, we use to open_dir() method to get the root directory and pass that, along with the partition address ID, the filesystem object, two empty lists, and an empty string, to the recurseFiles() method. These empty lists and string will come into play in recursive calls to this function as we will see shortly. Once the recurseFiles() method returns, we append the active file metadata to the recursed_data list. We repeat this process for each partition def openFS(vol, img, output): print "[+] Recursing through files.." recursed_data = [] # Open FS and Recurse if vol is not None: for part in vol: if part.len > 2048 and "Unallocated" not in part.desc and "Extended" not in part.desc and "Primary Table" not in part.desc: try: fs = pytsk3.FS_Info(img, offset=part.start*vol.info.block_size) except IOError, e: print "[-] Unable to open FS:n {}".format(e) root = fs.open_dir(path="/") data = recurseFiles(part.addr, fs, root, [], [], [""]) recursed_data.append(data) We employ a similar method for the second instance, where we have a logical image, where the volume is None. In this case, we attempt to directly access the filesystem and, if successful, we pass that to the recurseFiles() method and append the returned data to our recursed_data list. Once we have our active file list, we send it and the user supplied output file path to the csvWriter() method. Let's dive into the recurseFiles() method which is the meat of this recipe. else: try: fs = pytsk3.FS_Info(img) except IOError, e: print "[-] Unable to open FS:n {}".format(e) root = fs.open_dir(path="/") data = recurseFiles(1, fs, root, [], [], [""]) recursed_data.append(data) csvWriter(recursed_data, output) The recurseFiles() function is based on an example of the FLS tool (https://github.com/py4n6/pytsk/blob/master/examples/fls.py) and David Cowen's Automating DFIR series tool dfirwizard (https://github.com/dlcowen/dfirwizard/blob/master/dfirwiza rd-v9.py). To start this function, we append the root directory inode to the dirs list. This list is used later to avoid unending loops. Next, we begin to loop through each object in the root directory and check that it has certain attributes we would expect and that its name is not either "." or "..". def recurseFiles(part, fs, root_dir, dirs, data, parent): dirs.append(root_dir.info.fs_file.meta.addr) for fs_object in root_dir: # Skip ".", ".." or directory entries without a name. if not hasattr(fs_object, "info") or not hasattr(fs_object.info, "name") or not hasattr(fs_object.info.name, "name") or fs_object.info.name.name in [".", ".."]: continue If the object passes that test, we extract its name using the info.name.name attribute. Next, we use the parent variable, which was supplied as one of the function's inputs, to manually create the file path for this object. There is no built-in method or attribute to do this automatically for us. We then check if the file is a directory or not and set the f_type variable to the appropriate type. If the object is a file, and it has an extension, we extract it and store it in the file_ext variable. If we encounter an AttributeError when attempting to extract this data we continue onto the next object. try: file_name = fs_object.info.name.name file_path = "{}/{}".format("/".join(parent), fs_object.info.name.name) try: if fs_object.info.meta.type == pytsk3.TSK_FS_META_TYPE_DIR: f_type = "DIR" file_ext = "" else: f_type = "FILE" if "." in file_name: file_ext = file_name.rsplit(".")[-1].lower() else: file_ext = "" except AttributeError: continue We create variables for the object size and timestamps. However, notice that we pass the dates to a convertTime() method. This function exists to convert the UNIX timestamps into a human-readable format. With these attributes extracted, we append them to the data list using the partition address ID to ensure we keep track of which partition the object is from size = fs_object.info.meta.size create = convertTime(fs_object.info.meta.crtime) change = convertTime(fs_object.info.meta.ctime) modify = convertTime(fs_object.info.meta.mtime) data.append(["PARTITION {}".format(part), file_name, file_ext, f_type, create, change, modify, size, file_path]) If the object is a directory, we need to recurse through it to access all of its sub-directories and files. To accomplish this, we append the directory name to the parent list. Then, we create a directory object using the as_directory() method. We use the inode here, which is for all intents and purposes a unique number and check that the inode is not already in the dirs list. If that were the case, then we would not process this directory as it would have already been processed. If the directory needs to be processed, we call the recurseFiles() method on the new sub_directory and pass it current dirs, data, and parent variables. Once we have processed a given directory, we pop that directory from the parent list. Failing to do this will result in false file path details as all of the former directories will continue to be referenced in the path unless removed. Most of this function was under a large try-except block. We pass on any IOError exception generated during this process. Once we have iterated through all of the subdirectories, we return the data list to the openFS() function. if f_type == "DIR": parent.append(fs_object.info.name.name) sub_directory = fs_object.as_directory() inode = fs_object.info.meta.addr # This ensures that we don't recurse into a directory # above the current level and thus avoid circular loops. if inode not in dirs: recurseFiles(part, fs, sub_directory, dirs, data, parent) parent.pop(-1) except IOError: pass dirs.pop(-1) return data Let's briefly look at the convertTime() function. We've seen this type of function before, if the UNIX timestamp is not 0, we use the datetime.utcfromtimestamp() method to convert the timestamp into a human-readable format. def convertTime(ts): if str(ts) == "0": return "" return datetime.utcfromtimestamp(ts) With the active file listing data in hand, we are now ready to write it to a CSV file using the csvWriter() method. If we did find data (i.e., the list is not empty), we open the output CSV file, write the headers, and loop through each list in the data variable. We use the csvwriterows() method to write each nested list structure to the CSV file. def csvWriter(data, output): if data == []: print "[-] No output results to write" sys.exit(3) print "[+] Writing output to {}".format(output) with open(output, "wb") as csvfile: csv_writer = csv.writer(csvfile) headers = ["Partition", "File", "File Ext", "File Type", "Create Date", "Modify Date", "Change Date", "Size", "File Path"] csv_writer.writerow(headers) for result_list in data: csv_writer.writerows(result_list) The screenshot below demonstrates the type of data this recipe extracts from forensic images. There's more... For this recipe, there are a number of improvements that could further increase its utility: Use tqdm, or another library, to create a progress bar to inform the user of the current execution progress. Learn about the additional metadata values that can be extracted from filesystem objects using pytsk3 and add them to the output CSV file. Summary In summary, we have learned how to use pytsk3 to recursively iterate through any supported filesystem by the Sleuth Kit. This comprises the basis of how we can use the Sleuth Kit to programmatically process forensic acquisitions. With this recipe, we will now be able to further interact with these files in future recipes. Resources for Article: Further resources on this subject: [article] [article] [article]

0
0
16190

Packt

26 Aug 2013

8 min read

The Unreal Engine

Packt

26 Aug 2013

8 min read

(For more resources related to this topic, see here.) Sound cues versus sound wave data There are two types of sound entries in UDK: Sound cues Sound wave data The simplest difference between the two is that Sound Wave Data is what we would have if we imported a sound file into the editor, and a Sound Cue is taking a sound wave data or multiple sound wave datas and manipulating them or combining them using a fairly robust and powerful toolset that UDK gives us in their Sound Cue Editor. In terms of uses, sound wave datas are primarily only used as parts of sound cues. However, in terms of placing ambient sounds, that is, sounds that are just sort of always playing in the background, sound wave datas and sound cues offer different situations where each is used. Regardless, they both get represented in the level as Sound Actors, of which there are several types as shown in the following screenshot: Types of sound actors A key element of any well designed level is ambient sound effects. This requires placing sound actors into the world. Some of these actors use sound wave data and others use sound cues. There are strengths, weaknesses, and specific use cases for all of them, so we'll touch on those presently. Using sound cues There are two distinct types of sound actors that call for the use of sound cues specifically. The strength of using sound cues for ambient sounds is that the different sounds can be manipulated in a wider variety of ways. Generally, this isn't necessary as most ambient sounds are some looping sound used to add sound to things like torches, rippling streams, a subtle blowing wind, or other such environmental instances. The two types of sound actors that use sound cues are Ambient Sounds and Ambient Sound Movables as shown in the following screenshot: Ambient sound As the name suggests, this is a standard ambient sound. It stays exactly where you place it and cannot be moved. These ambient sound actors are generally used for stationary sounds that need some level of randomization or some other form of specific control of multiple sound wave datas. Ambient sound movable Functionally very similar to the regular ambient sound, this variation can, as the name suggests, be moved. That means, this sort of ambient sound actor should be used in a situation where an ambient sound would be used, but needs to be mobile. The main weakness of the two ambient sound actors that utilize sound cues is that each one you place in a level is identically set to the exact settings within the sound cue. Conversely, ambient sound actors utilizing sound wave datas can be set up on an instance by instance basis. What this means is explained with the help of an example. Lets say we have two fires in our game. One is a small torch, and the other is a roaring bonfire. If we feel that using the same sound for each is what we want to do, then we can place both the ambient sound actors utilizing sound wave datas and adjust some settings within the actor to make sure that the bonfire is louder and/or lower pitched. If we wanted this type of variation using sound cues, we would have to make separate sound cues. Using sound wave data There are four types of ambient sound actors that utilize sound wave datas directly as opposed to housed within sound cues. As previously mentioned, the purpose of using ambient sound actors that use sound wave data is to avoid having to create multiple sound cues with only minimally different contents for simple ambient sounds. This is most readily displayed by the fact that the most commonly used ambient sound actors that use sound wave data are called AmbientSoundSimple and AmbientSoundSimpleToggleable as shown in the following screenshot: Ambient sound simple Ambient sound simple is, as the name suggests, the simplest of ambient sound actors. They are only used when we need one sound wave data or multiple sound wave datas to just repeat on a loop over and over again. Fortunately, most ambient sounds in a level fit this description. In most cases, if we were to go through a level and do an ambient sound pass, all we would need to use are ambient sound simples. Ambient sound non loop Ambient sound non loop are pretty much the same, functionally, as ambient sound simples. The only difference is, as the name suggests, they don't loop. They will play whatever sound wave data(s) that are set in the actor, then delay by a number of seconds that is also set within the actor, and then go through it again. This is useful when we want to have a sound play somewhat intermittently, but not be on a regular loop. Ambient sound non looping toggleable Ambient sound non looping toggleable are, for all intents and purposes, the same as the regular ambient sound non loop actors, but they are toggleable. This means, put simply, that they can be turned on and off at will using Kismet. This would obviously be useful if we needed one of these intermittent sounds to play only when certain things happened first. Ambient sound simple toggleable Ambient sound simple toggleable are basically the same as a plain old, run of the mill ambient sound simple with the difference being, as like the ambient sound non looping toggleable, it can be turned on and off using Kismet. Playing sounds in Kismet There are several different ways to play different kinds of sounds using Kismet. Firstly, if we are using a toggleable ambient sound actor, then we can simply use a toggle sequence, which can be found under New Action | Toggle. There is also a Play Sound sequence located in New Action | Sound | Play Sound. Both of these are relatively straightforward in terms of where to plug in the sound cue. Playing sounds in Matinee If we need a sound to play as part of a Matinee sequence, the Matinee tool gives us the ability to trigger the sound in question. If we have a Matinee sequence that contains a Director track, then we need to simply right-click and select Add New Sound Track. From here, we just need to have the sound cue we want to use selected in the Content Browser, and then, with the Sound Track selected in the active Matinee window, we simply place the time marker where we want the sound to play and press Enter. This will place a keyframe that will trigger our sound to play, easy as pie. The Matinee tool dialog will look like the following screenshot: Matinee will only play one sound in a sound track at a time, so if we place multiple sounds and they overlap, they won't play simultaneously. Fortunately, we can have as many separate sound tracks as we need. So if we find ourselves setting up a Matinee and two or more sounds overlap in our sound track, we can just add a second one and move some of our sounds in to it. Now that we've gone over the different ways to directly play and use sound cues, let's look at how to make and manipulate the same sound cues using UDK's surprisingly robust Sound Cue Editor. Summary Now that we have a decent grasp of what kinds of sound control UDK offers us and how to manipulate sounds in the editor, we can set about bringing our game to audible life. A quick tip for placing ambient sounds: if you look at something that visually seems like it should be making a noise like a waterfall, a fire, a flickering light, or whatever else, then it probably should have an ambient sound of some sort placed right on it. And as always, what we've covered in this article is an overview of some of the bare bones basics required to get started exploring sounds and soundscapes in UDK. There are plenty of other actors, settings, and things that can be done. So, again, I recommend playing around with anything you can find. Experiment with everything in UDK and you'll learn all sorts of new and interesting things. Resources for Article: Further resources on this subject: Unreal Development Toolkit: Level Design HQ [Article] Getting Started on UDK with iOS [Article] Configuration and Handy Tweaks for UDK [Article]

0
0
16187

article-image-large-language-models-llms-and-knowledge-graphs

Mostafa Ibrahim

15 Nov 2023

7 min read

Large Language Models (LLMs) and Knowledge Graphs

Mostafa Ibrahim

15 Nov 2023

7 min read

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionHarnessing the power of AI, this article explores how Large Language Models (LLMs) like OpenAI's GPT can analyze data from Knowledge Graphs to revolutionize data interpretation, particularly in healthcare. We'll illustrate a use case where an LLM assesses patient symptoms from a Knowledge Graph to suggest diagnoses, showcasing LLM’s potential to support medical diagnostics with precision.Brief Introduction Into Large Language Models (LLMs)Large Language Models (LLMs), such as OpenAI's GPT series, represent a significant advancement in the field of artificial intelligence. These models are trained on vast datasets of text, enabling them to understand and generate human-like language.LLMs are adept at understanding complex questions and providing appropriate responses, akin to human analysis. This capability stems from their extensive training on diverse datasets, allowing them to interpret context and generate relevant text-based answers.While LLMs possess advanced data processing capabilities, their effectiveness is often limited by the static nature of their training data. Knowledge Graphs step in to fill this gap, offering a dynamic and continuously updated source of information. This integration not only equips LLMs with the latest data, enhancing the accuracy and relevance of their output but also empowers them to solve more complex problems with a greater level of sophistication. As we harness this powerful combination, we pave the way for innovative solutions across various sectors that demand real-time intelligence, such as the ever-fluctuating stock market.Exploring Knowledge Graphs and How LLMs Can Benefit From ThemKnowledge Graphs represent a pivotal advancement in organizing and utilizing data, especially in enhancing the capabilities of Large Language Models (LLMs).Knowledge Graphs organize data in a graph format, where entities (like people, places, and things) are nodes, and the relationships between them are edges. This structure allows for a more nuanced representation of data and its interconnected nature. Take the above Knowledge Graph as an example.Doctor Node: This node represents the doctor. It is connected to the patient node with an edge labeled "Patient," indicating the doctor-patient relationship.Patient Node (Patient123): This is the central node representing a specific patient, known as "Patient123." It serves as a junction point connecting to various symptoms that the patient is experiencing.Symptom Nodes: There are three separate nodes representing individual symptoms that the patient has: "Fever," "Cough," and "Shortness of breath." Each of these symptoms is connected to the patient node by edges labeled "Symptom," indicating that these are the symptoms experienced by "Patient123. To simplify, the Knowledge Graph shows that "Patient123" is a patient of the "Doctor" and is experiencing three symptoms: fever, cough, and shortness of breath. This type of graph is useful in medical contexts where it's essential to model the relationships between patients, their healthcare providers, and their medical conditions or symptoms. It allows for easy querying of related data—for example, finding all symptoms associated with a particular patient or identifying all patients experiencing a certain symptom.Practical Integration of LLMs and Knowledge GraphsStep 1: Installing and Importing the Necessary LibrariesIn this step, we're going to bring in two essential libraries: rdflib for constructing our Knowledge Graph and openai for tapping into the capabilities of GPT, the Large Language Model.!pip install rdflib !pip install openai==0.28 import rdflib import openaiStep 2: Import your Personal OPENAI API KEYopenai.api_key = "Insert Your Personal OpenAI API Key Here"Step 3: Creating a Knowledge Graph# Create a new and empty Knowledge graph g = rdflib.Graph() # Define a Namespace for health-related data namespace = rdflib.Namespace("http://example.org/health/")Step 4: Adding data to Our GraphIn this part of the code, we will introduce a single entry to the Knowledge Graph pertaining to patient124. This entry will consist of three distinct nodes, each representing a different symptom exhibited by the patient.def add_patient_data(patient_id, symptoms): patient_uri = rdflib.URIRef(patient_id) for symptom in symptoms: symptom_predicate = namespace.hasSymptom g.add((patient_uri, symptom_predicate, rdflib.Literal(symptom))) # Example of adding patient data add_patient_data("Patient123", ["fever", "cough", "shortness of breath"])Step 5: Identifying the get_stock_price functionWe will utilize a simple query in order to extract the required data from the knowledge graph.def get_patient_symptoms(patient_id): # Correctly reference the patient's URI in the SPARQL query patient_uri = rdflib.URIRef(patient_id) sparql_query = f""" PREFIX ex: <http://example.org/health/> SELECT ?symptom WHERE {{ <{patient_uri}> ex:hasSymptom ?symptom. }} """ query_result = g.query(sparql_query) symptoms = [str(row.symptom) for row in query_result] return symptomsStep 6: Identifying the generate_llm_response functionThe generate_daignosis_response function takes as input the user’s name along with the list of symptoms extracted from the graph. Moving on, the LLM uses such data in order to give the patient the most appropriate diagnosis.def generate_diagnosis_response(patient_id, symptoms): symptoms_list = ", ".join(symptoms) prompt = f"A patient with the following symptoms - {symptoms_list} - has been observed. Based on these symptoms, what could be a potential diagnosis?" # Placeholder for LLM response (use the actual OpenAI API) llm_response = openai.Completion.create( model="text-davinci-003", prompt=prompt, max_tokens=100 ) return llm_response.choices[0].text.strip() # Example usage patient_id = "Patient123" symptoms = get_patient_symptoms(patient_id) if symptoms: diagnosis = generate_diagnosis_response(patient_id, symptoms) print(diagnosis) else: print(f"No symptoms found for {patient_id}.")Output: The potential diagnosis could be pneumonia. Pneumonia is a type of respiratory infection that causes symptoms including fever, cough, and shortness of breath. Other potential diagnoses should be considered as well and should be discussed with a medical professional.As demonstrated, the LLM connected the three symptoms—fever, cough, and shortness of breath—to suggest that patient123 may potentially be diagnosed with pneumonia.ConclusionIn summary, the collaboration of Large Language Models and Knowledge Graphs presents a substantial advancement in the realm of data analysis. This article has provided a straightforward illustration of their potential when working in tandem, with LLMs to efficiently extract and interpret data from Knowledge Graphs.As we further develop and refine these technologies, we hold the promise of significantly improving analytical capabilities and informing more sophisticated decision-making in an increasingly data-driven world.Author BioMostafa Ibrahim is a dedicated software engineer based in London, where he works in the dynamic field of Fintech. His professional journey is driven by a passion for cutting-edge technologies, particularly in the realms of machine learning and bioinformatics. When he's not immersed in coding or data analysis, Mostafa loves to travel.Medium

0
0
16177

article-image-at-defcon-27-darpas-10-million-voting-system-could-not-be-hacked-by-voting-village-hackers-due-to-a-bug

Savia Lobo

12 Aug 2019

4 min read

At DefCon 27, DARPA's $10 million voting system could not be hacked by Voting Village hackers due to a bug

Savia Lobo

12 Aug 2019

4 min read

At the DefCon security conference in Las Vegas, for the last two years, hackers have come to the Voting Village every year to scrutinize voting machines and analyze them for vulnerabilities. This year, at DefCon 27, the targeted voting machine included a $10 million project by DARPA (Defense Advanced Research Projects Agency). However, hackers were unable to break into the system, not because of robust security features, but due to technical difficulties during the setup. “A bug in the machines didn't allow hackers to access their systems over the first two days,” CNet reports. DARPA announced this voting system in March, this year, hoping that it “will be impervious to hacking”. The system will be designed by the Oregon-based verifiable systems firm, Galois. “The agency hopes to use voting machines as a model system for developing a secure hardware platform—meaning that the group is designing all the chips that go into a computer from the ground up, and isn’t using proprietary components from companies like Intel or AMD,” Wired reports. Linton Salmon, the project’s program manager at Darpa says, “The goal of the program is to develop these tools to provide security against hardware vulnerabilities. Our goal is to protect against remote attacks.” Voting Village's co-founder Harri Hursti said, the five machines brought in by Galois, “seemed to have had a myriad of different kinds of problems. Unfortunately, when you're pushing the envelope on technology, these kinds of things happen." “The Darpa machines are prototypes, currently running on virtualized versions of the hardware platforms they will eventually use.” However, at Voting Village 2020, Darpa plans to include complete systems for hackers to access. Dan Zimmerman, principal researcher at Galois said, “All of this is here for people to poke at. I don’t think anyone has found any bugs or issues yet, but we want people to find things. We’re going to make a small board solely for the purpose of letting people test the secure hardware in their homes and classrooms and we’ll release that.” Sen. Wyden says if voting system security standards fail to change, the consequences will be much worse than 2016 elections After the cyberattacks in the 2016 U.S. presidential elections, there is a higher risk of securing voters data in the upcoming presidential elections next year. Senator Ron Wyden said if the voting system security standards fail to change, the consequences could be far worse than the 2016 elections. In his speech on Friday at the Voting Village, Wyden said, "If nothing happens, the kind of interference we will see form hostile foreign actors will make 2016 look like child's play. We're just not prepared, not even close, to stop it." Wyden proposed an election security bill requiring paper ballots in 2018. However, the bill was blocked in the Senate by Majority Leader Mitch McConnell who called the bill a partisan legislation. On Friday, a furious Wyden held McConnell responsible calling him the reason why Congress hasn't been able to fix election security issues. "It sure seems like Russia's No. 1 ally in compromising American election security is Mitch McConnell," Wyden said. https://twitter.com/ericgeller/status/1159929940533321728 According to a security researcher, the voting system has a terrible software vulnerability Dan Wallach, a security researcher at Rice University in Houston, Texas told Wired, “There’s a terrible software vulnerability in there. I know because I wrote it. It’s a web server that anyone can connect to and read/write arbitrary memory. That’s so bad. But the idea is that even with that in there, an attacker still won’t be able to get to things like crypto keys or anything really. All they would be able to do right now is crash the system.” According to CNet, “While the voting process worked, the machines weren't able to connect with external devices, which hackers would need in order to test for vulnerabilities. One machine couldn't connect to any networks, while another had a test suite that didn't run, and a third machine couldn't get online.” The machine's prototype allows people to vote with a touchscreen, print out their ballot and insert it into the verification machine, which ensures that votes are valid through a security scan. According to Wired, Galois even added vulnerabilities on purpose to see how its system defended against flaws. https://twitter.com/VotingVillageDC/status/1160663776884154369 To know more about this news in detail, head over to Wired report. DARPA plans to develop a communication platform similar to WhatsApp DARPA’s $2 Billion ‘AI Next’ campaign includes a Next-Generation Nonsurgical Neurotechnology (N3) program Black Hat USA 2019 conference Highlights: IBM’s ‘warshipping’, OS threat intelligence bots, Apple’s $1M bug bounty programs and much more!

0
0
16171

article-image-getting-started-with-qt-widgets-in-android-video

Sugandha Lahoti

22 Oct 2018

2 min read

Getting started with Qt Widgets in Android [video]

Sugandha Lahoti

22 Oct 2018

2 min read

Qt is a powerful, cross-platform, graphics development framework. It provides a large set of consistent, standardized libraries and works on many major platforms, including embedded, mobile, desktop, and the web. Qt’s significant mobile offerings are aligned with the trend to go mobile. These include QtQuick, QML, Qt widgets in Android, and communicating between C++ and QML. In this video, Benjamin Hoff introduces us to the Qt widgets in Android. He talks about how to install Qt Android environment, set up Qt creator for Android deployment, and to build for release. This clip is taken from the course Mastering Qt 5 GUI Programming by Benjamin Hoff. With this course, you will master application development with Qt for Android, Windows, Linux, and web. Installing Qt Android environment Install Android SDK (Standard Development Kit) and Android NDK (Native Development Kit) Install Java SE Development Kit (JDK) or OpenJDK on Linux Compile Qt for the architecture you’re targeting, You can download it using the Qt online installer or install it using your package manager. Watch the video to walk through each of the methods in detail. If you liked the video, don’t forget to check out the comprehensive course Mastering Qt 5 GUI Programming, packed with step-by-step instructions, working examples, and helpful tips and techniques on working with Qt. About the author Benjamin Hoff is a Mechanical Engineer by education who has spent the first 3 years of his career doing graphics processing, desktop application development, and facility simulation using a mixture of C++ and python under the tutelage of a professional programmer. After rotating back into a mechanical engineering job, Benjamin has continued to develop software utilizing the skills he developed during his time as a professional programmer. Qt creator 4.8 beta released, adds language server protocol. WebAssembly comes to Qt. Now you can deploy your next Qt app in browser How to create multithreaded applications in Qt

0
0
16153

How-To Tutorials

article-image-perform-crud-operations-on-mongodb-with-php

Amey Varangaonkar

17 Mar 2018

6 min read

Perform CRUD operations on MongoDB with PHP

Amey Varangaonkar

17 Mar 2018

6 min read

0
0
16145

Prakhar Mishra

06 Nov 2023

9 min read

Fine-Tuning LLaMA 2

Prakhar Mishra

06 Nov 2023

9 min read

Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionLarge Language Models have recently become the talk of the town. I am very sure, you must have heard of ChatGPT. Yes, that’s an LLM, and that’s what I am talking about. Every few weeks, we have been witnessing newer, better but not necessarily larger LLMs coming out either as open-source or closed-source. This is probably the best time to learn about them and make these powerful models work for your specific use case.In today’s blog, we will look into one of the recent open-source models called Llama2 and try to fine-tune it on a standard NLP task of recognizing entities from text. We will first look into what are large language models, what are open-source and closed-source models, and some examples of them. We will then move to learning about Llama2 and why is it so special. We then describe our NLP task and dataset. Finally, we get into coding.About Large Language Models (LLMs)Language models are artificial intelligence systems that have been trained to understand and generate human language. Large Language Models (LLMs) like GPT-3, ChatGPT, GPT-4, Bard, and similar can perform diverse sets of tasks out of the box. Often the quality of output from these large language models is highly dependent on the finesse of the prompt given by the user.These Language models are trained on vast amounts of text data from the Internet. Most of the language models are trained in an auto-regressive way i.e. they try to maximize the probability of the next word based on the words they have produced or seen in the past. This data includes a wide range of written text, from books and articles to websites and social media posts. Language models have a wide range of applications, including chatbots, virtual assistants, content generation, and more. They can be used in industries like customer service, healthcare, finance, and marketing.Since these models are trained on enormous data, they are already good at zero-shot inference and can be steered to perform better with few-shot examples. Zero-shot is a setup in which a model can learn to recognize things that it hasn't explicitly seen before in training. In a Few-shot setting, the goal is to make predictions for new classes based on the few examples of labeled data that is provided to it at inference time.Despite their amazing capabilities of generating text, these humongous models come with a few limitations that must be thought of when building an LLM-based production pipeline. Some of these limitations are hallucinations, biases, and more.Closed and Open-source Language ModelsLarge language models from closed-source are those employed by some companies and are not readily accessible to the public. Training data for these models are typically kept private. While they can be highly sophisticated, this limits transparency, potentially leading to concerns about bias, and data privacy.In contrast, open-source projects like GPT-3, are designed to be freely available to researchers and developers. These models are trained on extensive, publicly available datasets, allowing for a degree of transparency and collaboration.The decision between closed- and open-source language models is influenced by several variables, such as the project's goals, the need for openness, and others.About LLama2Meta's open-source LLM is called Llama 2. It was trained with 2 trillion "tokens" from publicly available sources like Wikipedia, Common Crawl, and books from the Gutenberg project. Three different parameter level model versions are available, i.e. 7 billion, 13 billion, and 70 billion parameter models. There are two types of completion models available: Chat-tuned and General. The chat-tuned models that have been fine-tuned for chatbot-like dialogue are denoted by the suffix '-chat'. We will use general Meta's 7b Llama-2 huggingface model as the base model that we fine-tune. Feel free to use any other version of llama2-7b.Also, if you are interested, there are several threads that you can go through to understand how good is Llama family w.r.t GPT family is - source, source, source.About Named Entity RecognitionAs a component of information extraction, named-entity recognition locates and categorizes specific entities inside the unstructured text by allocating them to pre-defined groups, such as individuals, organizations, locations, measures, and more. NER offers a quick way to understand the core idea or content of a lengthy text.There are many ways of extracting entities from a given text, in this blog, we will specifically delve into fine-tuning Llama2-7b using PEFT techniques on Colab Notebook.We will transform the SMSSpamCollection classification data set for NER. Pretty interesting 😀We search through all 10 letter words and tag them as 10_WORDS_LONG. And this is the entity that we want our Llama to extract. But why this bizarre formulation? I did it intentionally to show that this is something that the pre-trained model would not have seen during the pre-training stage. So it becomes essential to fine-tune it and make it work for our use case 👍. But surely we can add logic to our formulation - think of these words as probable outliers/noisy words. The larger the words, the higher the possibility of it being noise/oov. However, you will have to come up with the extract letter count after seeing the word length distribution. Please note that the code is generic enough for fine-tuning any number of entities. It’s just a change in the data preparation step that we will make to slice out only relevant entities.Code for Fine-tuning Llama2-7b# Importing Libraries from transformers import LlamaTokenizer, LlamaForCausalLM import torch from datasets import Dataset import transformers import pandas as pd from peft import get_peft_model, LoraConfig, TaskType, prepare_model_for_int8_training, get_peft_model_state_dict, PeftModel from sklearn.utils import shuffleData Preparation Phasedf = pd.read_csv('SMSSpamCollection', sep='\t', header=None) all_text = df[1].str.lower().tolist() input, output = [], [] for text in all_text: input.append(text) output.append({word: '10_WORDS_LONG' for word in text.split() if len(word)==10}) df = pd.DataFrame([input, output]).T df.rename({0:'input_text', 1: 'output_text'}, axis=1, inplace=True) print (df.head(5)) total_ds = shuffle(df, random_state=42) total_train_ds = total_ds.head(4000) total_test_ds = total_ds.tail(1500) total_train_ds_hf = Dataset.from_pandas(total_train_ds) total_test_ds_hf = Dataset.from_pandas(total_test_ds) tokenized_tr_ds = total_train_ds_hf.map(generate_and_tokenize_prompt) tokenized_te_ds = total_test_ds_hf.map(generate_and_tokenize_prompt) Fine-tuning Phase# Loading Modelmodel_name = "meta-llama/Llama-2-7b-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) def create_peft_config(m): peft_cofig = LoraConfig( task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.05, target_modules=['q_proj', 'v_proj'], ) model = prepare_model_for_int8_training(model) model.enable_input_require_grads() model = get_peft_model(model, peft_cofig) model.print_trainable_parameters() return model, peft_cofig model, lora_config = create_peft_config(model) def generate_prompt(data_point): return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Extract entity from the given input: ### Input: {data_point["input_text"]} ### Response: {data_point["output_text"]}""" tokenizer.pad_token_id = 0 def tokenize(prompt, add_eos_token=True): result = tokenizer( prompt, truncation=True, max_length=128, padding=False, return_tensors=None, ) if ( result["input_ids"][-1] != tokenizer.eos_token_id and len(result["input_ids"]) < 128 and add_eos_token ): result["input_ids"].append(tokenizer.eos_token_id) result["attention_mask"].append(1) result["labels"] = result["input_ids"].copy() return result def generate_and_tokenize_prompt(data_point): full_prompt = generate_prompt(data_point) tokenized_full_prompt = tokenize(full_prompt) return tokenized_full_prompt training_arguments = transformers.TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=16, learning_rate=4e-05, logging_steps=100, optim="adamw_torch", evaluation_strategy="steps", save_strategy="steps", eval_steps=100, save_steps=100, output_dir="saved_models/" ) data_collator = transformers.DataCollatorForSeq2Seq(tokenizer) trainer = transformers.Trainer(model=model, tokenizer=tokenizer, train_dataset=tokenized_tr_ds, eval_dataset=tokenized_te_ds, args=training_arguments, data_collator=data_collator) with torch.autocast("cuda"): trainer.train()InferenceLoaded_tokenizer = LlamaTokenizer.from_pretrained(model_name) Loaded_model = LlamaForCausalLM.from_pretrained(model_name, load_in_8bit=True, torch.dtype=torch.float16, device_map=’auto’) Model = PeftModel.from_pretrained(Loaded_model, “saved_model_path”, torch.dtype=torch.float16) Model.config.pad_tokeni_id = loaded_tokenizer.pad_token_id = 0 Model.eval() def extract_entity(text): inp = Loaded_tokenizer(prompt, return_tensor=’pt’).to(“cuda”) with torch.no_grad(): P_ent = Loaded_tokenizer.decode(model.generate(**inp, max_new_tokens=128)[0], skip_special_tokens=True) int_idx = P_ent.find(‘Response:’) P_ent = P_ent[int_idx+len(‘Response:’):] return P_ent.strip() extracted_entity = extract_entity(text) print (extracted_entity) ConclusionWe covered the process of optimizing the llama2-7b model for the Named Entity Recognition job in this blog post. For that matter, it can be any task that you are interested in. The core concept that one must learn from this blog is PEFT-based training of large language models. Additionally, as pre-trained LLMs might not always perform well in your work, it is best to fine-tune these models.Author BioPrakhar Mishra has a Master’s in Data Science with over 4 years of experience in industry across various sectors like Retail, Healthcare, Consumer Analytics, etc. His research interests include Natural Language Understanding and generation, and has published multiple research papers in reputed international publications in the relevant domain. Feel free to reach out to him on LinkedIn

0
0
16143

article-image-knowing-the-threat-actors-behind-a-cyber-attack

Savia Lobo

30 Mar 2019

7 min read

Knowing the threat actors behind a cyber attack

Savia Lobo

30 Mar 2019

7 min read

This article is an excerpt taken from the book, Hands-On Cybersecurity for Finance written by Dr. Erdal Ozkaya and Milad Aslaner. In this book you will learn how to successfully defend your system against common cyber threats, making sure your financial services are a step ahead in terms of security. In this article, you will understand the different types of threat actor groups and their motivations. The attackers behind cyber attacks can be classified into the following categories: Cybercriminals Cyber terrorists Hacktivists "What really concerns me is the sophistication of the capability, which is becoming good enough to really threaten parts of our critical infrastructure, certainly in the financial, banking sector."– Director of Europol Robert Wainwright Hacktivism Hacktivism, as defined by the Industrial Control Systems Cyber Emergency Response Team (ICS-CERT), refers to threat actors that depend on propaganda rather than damage to critical infrastructures. Their goal is to support their own political agenda, which varies between anti-corruption, religion, environmental, or anti-establishment concerns. Their sub-goals are propaganda and causing damage to achieve notoriety for their cause. One of the most prominent hacktivist threat actor groups is Anonymous. Anonymous is known primarily for their distributed denial of service (DDoS) attacks on governments and the Church of Scientology. The following screenshot shows "the man without a head," which is commonly used by Anonymous as their emblem: Hacktivists target companies and governments based on the organization's mission statement or ethics. Given that the financial services industry is responsible for economic wealth, they tend to be a popular target for hacktivists. The ideologies held by hacktivists can vary, but at their core, they focus on bringing attention to social issues such as warfare or what they consider to be illegal activities. To spread their beliefs, they choose targets that allow them to spread their message as quickly as possible. The primary reason why hacktivists are choosing organizations in the financial services industry sector is that these organizations typically have a large user base, allowing them to raise the profile of their beliefs very quickly once they have successfully breached the organization's security controls. Case study – Dakota Access Pipeline The Dakota Access Pipeline (DAPL) was a 2016 construction of a 1.172-mile-long pipeline that spanned three states in the US. Native American tribes were protesting against the DAPL because of the fear that it would damage sacred grounds and drinking water. Shortly after the protests began, the hacktivist group Anonymous publicly announced their support under the name OpNoDAPL. During the construction, Anonymous launched numerous DDoS attacks against the organizations involved in the DAPL. Anonymous leaked the personal information of employees that were responsible for the DAPL and threatened that this would continue if they did not quit. The following screenshot shows how this attack spread on Twitter: Case study – Panama Papers In 2015, an offshore law firm called Mossack Fonseca had 11.5 million of their documents leaked. These documents contained confidential financial information for more than 214,488 offshore entities under what was later known as the Panama Papers. In the leaked documents, several national leaders, politicians, and industry leaders were identified, including a trail to Vladimir Putin. The following diagram shows how much was exposed as part of this attack: While there is not much information available on how the cyber attack occurred, various security researchers have analyzed the operation. According to the WikiLeaks post, which claims to show a client communication from Mossack Fonseca, they confirm that there was a breach of their "email server". Considering the size of the data leak, it is believed that a direct attack occurred on the email servers. Cyber terrorists Extremist and terrorist organizations such as Al Qaeda and Islamic State of Iraq and Syria (ISIS) are using the internet to distribute their propaganda, recruiting new terrorists and communicating via this medium. An example of this is the 2008 attack in Mumbai, in which one of the gunmen confirmed that they used Google Earth to familiarize themselves with the locations of buildings. Cyber terrorism is an extension of traditional terrorism in cyber space. Case study – Operation Ababil In 2012, the Islamic group Izz ad-Din al-Qassam Cyber Fighters—which is a military wing of Hamas—attacked a series of American financial institutions. On September 18th 2012, this threat actor group confirmed that they were behind the cyber attack and justified it due to the relationship of the United States government with Israel. They also claimed that this was a response to the Innocence of Muslims video released by the American pastor Terry Jones. As part of a DDoS attack, they targeted the New York Stock Exchange as well as banks such as J.P. Morgan Chase. Cyber criminals Cyber criminals are either individuals or groups of hackers who use technology to commit crimes in the digital world. The primary driver of cyber criminals is financial gain and/or service disruption. Cyber criminals use computers in three broad ways: Select computers as their target: These criminals attack other people's computers to perform malicious activities, such as spreading viruses, data theft, identity theft, and more. Use computers as their weapon: They use computers to carry out "conventional crime", such as spam, fraud, illegal gambling, and more. Use computers as an accessory: They use computers to save stolen or illegal data. The following provides the larger picture so we can understand how Cyber Criminals has penetrated into the finance sector and wreaked havoc: Becky Pinkard, vice president of service delivery and intelligence at Digital Shadows Ltd, states that "Attackers can harm the bank by adding or subtracting a zero with every balance, or even by deleting entire accounts". Case study – FIN7 On August 1st 2018, the United States District Attorney's Office for the Western District of Washington announced the arrest of several members of the cyber criminal organization FIN7, which had been tracked since 2015. To this date, security researchers believe that FIN7 is one of the largest threat actor groups in the financial services industry. Combi Security is a FIN7 shelf company. The screenshot presented here shows a phishing email sent by FIN7 to victims claiming it was sent by the US Food and Drug Administration (FDA) Case study – Carbanak APT Attack Carbanak is an advanced persistent threat (APT) attack that is believed to have been executed by the threat actor group Cobalt Strike Group in 2014. In this operation, the threat actor group was able to generate a total financial loss for victims of more than 1 billion US dollars. The following depicts how the Carbanak cyber-gang stole $1bn by targeting a bank: Case study – OurMine operation In 2016, the threat actor group OurMine, who are suspected to operate in Saudi Arabia, conducted a DDoS attack against HSBC's websites, hosted in the USA and UK. The following screenshot shows the communication by the threat actor: The result of the DDoS attack was that HSBC websites for the US and the UK were unavailable. The following screenshot shows the HSBC USA website after the DDoS attack: Summary The financial services industry is one of the most popular victim industries for cybercrime. Thus, in this article, you learned about different threat actor groups and their motivations. It is important to understand these in order to build and execute a successful cybersecurity strategy. Head over to the book, Hands-On Cybersecurity for Finance, to know more about the costs associated with cyber attacks and cybersecurity. RSA Conference 2019 Highlights: Top 5 cybersecurity products announced Security experts, Wolf Halton and Bo Weaver, discuss pentesting and cybersecurity [Interview] Hackers are our society’s immune system – Keren Elazari on the future of Cybersecurity

0
0
16138

How-To Tutorials

article-image-implementing-apache-spark-k-means-clustering-method-on-digital-breath-test-data-for-road-safety

Savia Lobo

01 Mar 2018

7 min read

Implementing Apache Spark K-Means Clustering method on digital breath test data for road safety

Savia Lobo

01 Mar 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Mastering Apache Spark 2.x - Second Edition written by Romeo Kienzler. In this book, you will learn to use Spark as a big data operating system, understand how to implement advanced analytics on the new APIs, and explore how easy it is to use Spark in day-to-day tasks.[/box] In today’s tutorial, we have used the Road Safety test data from our previous article, to show how one can attempt to find clusters in data using K-Means algorithm with Apache Spark MLlib. Theory on Clustering The K-Means algorithm iteratively attempts to determine clusters within the test data by minimizing the distance between the mean value of cluster center vectors, and the new candidate cluster member vectors. The following equation assumes dataset members that range from X1 to Xn; it also assumes K cluster sets that range from S1 to Sk, where K <= n. K-Means in practice The K-Means MLlib functionality uses the LabeledPoint structure to process its data and so it needs numeric input data. As the same data from the last section is being reused, we will not explain the data conversion again. The only change that has been made in data terms in this section, is that processing in HDFS will now take place under the /data/spark/kmeans/ directory. Additionally, the conversion Scala script for the K-Means example produces a record that is all comma-separated. The development and processing for the K-Means example has taken place under the /home/hadoop/spark/kmeans directory to separate the work from other development. The sbt configuration file is now called kmeans.sbt and is identical to the last example, except for the project name: name := "K-Means" The code for this section can be found in the software package under chapter7K-Means. So, looking at the code for kmeans1.scala, which is stored under kmeans/src/main/scala, some similar actions occur. The import statements refer to the Spark context and configuration. This time, however, the K-Means functionality is being imported from MLlib. Additionally, the application class name has been changed for this example to kmeans1: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.clustering.{KMeans,KMeansModel} object kmeans1 extends App { The same actions are being taken as in the last example to define the data file--to define the Spark configuration and create a Spark context: val hdfsServer = "hdfs://localhost:8020" val hdfsPath = "/data/spark/kmeans/" val dataFile = hdfsServer + hdfsPath + "DigitalBreathTestData2013- MALE2a.csv" val sparkMaster = "spark://localhost:7077" val appName = "K-Means 1" val conf = new SparkConf() conf.setMaster(sparkMaster) conf.setAppName(appName) val sparkCxt = new SparkContext(conf) Next, the CSV data is loaded from the data file and split by comma characters into the VectorData variable: val csvData = sparkCxt.textFile(dataFile) val VectorData = csvData.map { csvLine => Vectors.dense( csvLine.split(',').map(_.toDouble)) } A KMeans object is initialized, and the parameters are set to define the number of clusters and the maximum number of iterations to determine them: val kMeans = new KMeans val numClusters = 3 val maxIterations = 50 Some default values are defined for the initialization mode, number of runs, and Epsilon, which we needed for the K-Means call but did not vary for the processing. Finally, these parameters were set against the KMeans object: val initializationMode = KMeans.K_MEANS_PARALLEL val numRuns = 1 val numEpsilon = 1e-4 kMeans.setK( numClusters ) kMeans.setMaxIterations( maxIterations ) kMeans.setInitializationMode( initializationMode ) kMeans.setRuns( numRuns ) kMeans.setEpsilon( numEpsilon ) We cached the training vector data to improve the performance and trained the KMeans object using the vector data to create a trained K-Means model: VectorData.cache val kMeansModel = kMeans.run( VectorData ) We have computed the K-Means cost and number of input data rows, and have output the results via println statements. The cost value indicates how tightly the clusters are packed and how separate the clusters are: val kMeansCost = kMeansModel.computeCost( VectorData ) println( "Input data rows : " + VectorData.count() ) println( "K-Means Cost : " + kMeansCost ) Next, we have used the K-Means Model to print the cluster centers as vectors for each of the three clusters that were computed: kMeansModel.clusterCenters.foreach{ println } Finally, we use the K-Means model predict function to create a list of cluster membership predictions. We then count these predictions by value to give a count of the data points in each cluster. This shows which clusters are bigger and whether there really are three clusters: val clusterRddInt = kMeansModel.predict( VectorData ) val clusterCount = clusterRddInt.countByValue clusterCount.toList.foreach{ println } } // end object kmeans1 So, in order to run this application, it must be compiled and packaged from the kmeans subdirectory as the Linux pwd command shows here: [hadoop@hc2nn kmeans]$ pwd /home/hadoop/spark/kmeans [hadoop@hc2nn kmeans]$ sbt package Loading /usr/share/sbt/bin/sbt-launch-lib.bash [info] Set current project to K-Means (in build file:/home/hadoop/spark/kmeans/) [info] Compiling 2 Scala sources to /home/hadoop/spark/kmeans/target/scala-2.10/classes... [info] Packaging /home/hadoop/spark/kmeans/target/scala-2.10/k- means_2.10-1.0.jar ... [info] Done packaging. [success] Total time: 20 s, completed Feb 19, 2015 5:02:07 PM Once this packaging is successful, we check HDFS to ensure that the test data is ready. As in the last example, we convert our data to numeric form using the convert.scala file, provided in the software package. We will process the DigitalBreathTestData2013- MALE2a.csv data file in the HDFS directory, /data/spark/kmeans, as follows: [hadoop@hc2nn nbayes]$ hdfs dfs -ls /data/spark/kmeans Found 3 items -rw-r--r-- 3 hadoop supergroup 24645166 2015-02-05 21:11 /data/spark/kmeans/DigitalBreathTestData2013-MALE2.csv -rw-r--r-- 3 hadoop supergroup 5694226 2015-02-05 21:48 /data/spark/kmeans/DigitalBreathTestData2013-MALE2a.csv drwxr-xr-x - hadoop supergroup 0 2015-02-05 21:46 /data/spark/kmeans/result The spark-submit tool is used to run the K-Means application. The only change in this command is that the class is now kmeans1: spark-submit --class kmeans1 --master spark://localhost:7077 --executor-memory 700M --total-executor-cores 100 /home/hadoop/spark/kmeans/target/scala-2.10/k-means_2.10-1.0.jar The output from the Spark cluster run is shown to be as follows: Input data rows : 467054 K-Means Cost : 5.40312223450789E7 The previous output shows the input data volume, which looks correct; it also shows the K- Means cost value. The cost is based on the Within Set Sum of Squared Errors (WSSSE) which basically gives a measure how well the found cluster centroids are matching the distribution of the data points. The better they are matching, the lower the cost. The following link https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/ explains WSSSE and how to find a good value for k in more detail. Next come the three vectors, which describe the data cluster centers with the correct number of dimensions. Remember that these cluster centroid vectors will have the same number of columns as the original vector data: [0.24698249738061878,1.3015883142472253,0.005830116872250263,2.917374778855 5207,1.156645130895448,3.4400290524342454] [0.3321793984152627,1.784137241326256,0.007615970459266097,2.58319870759289 17,119.58366028156011,3.8379106085083468] [0.25247226760684494,1.702510963969387,0.006384899819416975,2.2314042480006 88,52.202897927594805,3.551509158139135] Finally, cluster membership is given for clusters 1 to 3 with cluster 1 (index 0) having the largest membership at 407539 member vectors: (0,407539) (1,12999) (2,46516) To summarize, we saw a practical example that shows how K-means algorithm is used to cluster data with the help of Apache Spark. If you found this post useful, do check out this book Mastering Apache Spark 2.x - Second Edition to learn about the latest enhancements in Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets.

0
1
16136

article-image-article-phone-calls-send-sms-your-website-using-twilio

Packt

21 Mar 2014

9 min read

Make phone calls and send SMS messages from your website using Twilio

Packt

21 Mar 2014

9 min read

0
0
16111

GNU Octave: data analysis examples

MS Access Queries with Oracle SQL Developer 1.2 Tool

Microsoft Cognitive Services

Tableau Data Extract Best Practices

Importing Dynamic Data

Working with Forensic Evidence Container Recipes

The Unreal Engine

Large Language Models (LLMs) and Knowledge Graphs

At DefCon 27, DARPA's $10 million voting system could not be hacked by Voting Village hackers due to a bug

Getting started with Qt Widgets in Android [video]

Trending Topics

Perform CRUD operations on MongoDB with PHP

Fine-Tuning LLaMA 2

Knowing the threat actors behind a cyber attack

Implementing Apache Spark K-Means Clustering method on digital breath test data for road safety

Make phone calls and send SMS messages from your website using Twilio

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access