Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7014 Articles
article-image-and-running-pandas
Packt
18 Jul 2017
15 min read
Save for later

Up and Running with pandas

Packt
18 Jul 2017
15 min read
In this article by Michael Heydt, author of the book Learning Pandas - Second Edition, we will cover how to install pandas and start using its basic functionality.  The content is provided as IPython and Jupyter notebooks, and hence we will also take a quick look at using both of those tools. We will utilize the Anaconda Scientific Python distribution from Continuum. Anaconda is a popular Python distribution with both free and paid components. Anaconda provides cross-platform support—including Windows, Mac, and Linux. The base distribution of Anaconda installs pandas, IPython and Jupyter notebooks, thereby making it almost trivial to get started. (For more resources related to this topic, see here.) IPython and Jupyter Notebook So far we have executed Python from the command line / terminal. This is the default Read-Eval-Print-Loop (REPL)that comes with Python. Let’s take a brief look at both. IPython IPython is an alternate shell for interactively working with Python. It provides several enhancements to the default REPL provided by Python. If you want to learn about IPython in more detail check out the documentation at https://ipython.org/ipython-doc/3/interactive/tutorial.html To start IPython, simply execute the ipython command from the command line/terminal. When started you will see something like the following: The input prompt which shows In [1]:. Each time you enter a statement in the IPythonREPL the number in the prompt will increase. Likewise, output from any particular entry you make will be prefaced with Out [x]:where x matches the number of the corresponding In [x]:.The following demonstrates: This numbering of in and out statements will be important to the examples as all examples will be prefaced with In [x]:and Out [x]: so that you can follow along. Note that these numberings are purely sequential. If you are following through the code in the text and occur errors in input or enter additional statements, the numbering may get out of sequence (they can be reset by exiting and restarting IPython).  Please use them purely as reference. Jupyter Notebook Jupyter Notebook is the evolution of IPython notebook. It is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and markdown. The original IPython notebook was constrained to Python as the only language. Jupyter Notebook has evolved to allow many programming languages to be used including Python, R, Julia, Scala, and F#. If you want to take a deeper look at Jupyter Notebook head to the following URL http://jupyter.org/: where you will be presented with a page similar to the following: Jupyter Notebookcan be downloaded and used independently of Python. Anaconda does installs it by default. To start a Jupyter Notebookissue the following command at the command line/Terminal: $ jupyter notebook To demonstrate, let's look at how to run the example code that comes with the text. Download the code from the Packt website and and unzip the file to a directory of your choosing. In the directory, you will see the following contents: Now issue the jupyter notebook command. You should see something similar to the following: A browser page will open displaying the Jupyter Notebook homepage which is http://localhost:8888/tree. This will open a web browser window showing this page, which will be a directory listing: Clicking on a .ipynb link opens a notebook page like the following: The notebook that is displayed is HTML that was generated by Jupyter and IPython.  It consists of a number of cells that can be one of 4 types: code, markdown, raw nbconvert or heading. Jupyter runs an IPython kernel for each notebook.  Cells that contain Python code are executed within that kernel and the results added to the notebook as HTML. Double-clicking on any of the cells will make the cell editable.  When done editing the contents of a cell, press shift-return,  at which point Jupyter/IPython will evaluate the contents and display the results. If you wan to learn more about the notebook format that underlies the pages, see https://ipython.org/ipython-doc/3/notebook/nbformat.html. The toolbar at the top of a notebook gives you a number of abilities to manipulate the notebook. These include adding, removing, and moving cells up and down in the notebook. Also available are commands to run cells, rerun cells, and restart the underlying IPython kernel. To create a new notebook, go to File > New Notebook > Python 3 menu item. A new notebook page will be created in a new browser tab. Its name will be Untitled: The notebook consists of a single code cell that is ready to have Python entered.  EnterIPython1+1 in the cell and press Shift + Enter to execute. The cell has been executed and the result shown as Out [2]:.  Jupyter also opened a new cellfor you to enter more code or markdown. Jupyter Notebook automatically saves your changes every minute, but it's still a good thing to save manually every once and a while. One final point before we look at a little bit of pandas. Code in the text will be in the format of  command -line IPython. As an example, the cell we just created in our notebook will be shown as follows: In [1]: 1+1 Out [1]: 2 Introducing the pandas Series and DataFrame Let’s jump into using some pandas with a brief introduction to pandas two main data structures, the Series and the DataFrame. We will examine the following: Importing pandas into your application Creating and manipulating a pandas Series Creating and manipulating a pandas DataFrame Loading data from a file into a DataFrame The pandas Series The pandas Series is the base data structure of pandas. A Series is similar to a NumPy array, but it differs by having an index which allows for much richer lookup of items instead of just a zero-based array index value. The following creates a Series from a Python list.: In [2]: # create a four item Series s = Series([1, 2, 3, 4]) s Out [2]:0 1 1 2 2 3 3 4 dtype: int64 The output consists of two columns of information.  The first is the index and the second is the data in the Series. Each row of the output represents the index label (in the first column) and then the value associated with that label. Because this Series was created without specifying an index (something we will do next), pandas automatically creates an integer index with labels starting at 0 and increasing by one for each data item. The values of a Series object can then be accessed through the using the [] operator, paspassing the label for the value you require. The following gets the value for the label 1: In [3]:s[1] Out [3]: 2 This looks very much like normal array access in many programming languages. But as we will see, the index does not have to start at 0, nor increment by one, and can be many other data types than just an integer. This ability to associate flexible indexes in this manner is one of the great superpowers of pandas. Multiple items can be retrieved by specifying their labels in a python list. The following retrieves the values at labels 1 and 3: In [4]: # return a Series with the row with labels 1 and 3 s[[1, 3]] Out [4]: 1 2 3 4 dtype: int64 A Series object can be created with a user-defined index by using the index parameter and specifying the index labels. The following creates a Series with the same valuesbut with an index consisting of string values: In [5]:# create a series using an explicit index s = pd.Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd']) s Out [5}:a 1 b 2 c 3 d 4 dtype: int64 Data in the Series object can now be accessed by those alphanumeric index labels. The following retrieves the values at index labels 'a' and 'd': In [6]:# look up items the series having index 'a' and 'd' s[['a', 'd']] Out [6]:a 1 d 4 dtype: int64 It is still possible to refer to the elements of this Series object by their numerical 0-based position. : In [7]:# passing a list of integers to a Series that has # non-integer index labels will look up based upon # 0-based index like an array s[[1, 2]] Out [7]:b 2 c 3 dtype: int64 We can examine the index of a Series using the .index property: In [8]:# get only the index of the Series s.index Out [8]: Index(['a', 'b', 'c', 'd'], dtype='object') The index is itself actually a pandas objectand this output shows us the values of the index and the data type used for the index. In this case note that the type of the data in the index (referred to as the dtype) is object and not string. A common usage of a Series in pandas is to represent a time-series that associates date/time index labels withvalues. The following demonstrates by creating a date range can using the pandas function pd.date_range(): In [9]:# create a Series who's index is a series of dates # between the two specified dates (inclusive) dates = pd.date_range('2016-04-01', '2016-04-06') dates Out [9]:DatetimeIndex(['2016-04-01', '2016-04-02', '2016- 04-03', '2016-04-04', '2016-04-05', '2016-04-06'], dtype='datetime64[ns]', freq='D') This has created a special index in pandas referred to as a DatetimeIndex which is a specialized type of pandas index that is optimized to index data with dates and times. Now let's create a Series using this index. The data values represent high-temperatures on those specific days: In [10]:# create a Series with values (representing temperatures) # for each date in the index temps1 = Series([80, 82, 85, 90, 83, 87], index = dates) temps1 Out [10]:2016-04-01 80 2016-04-02 82 2016-04-03 85 2016-04-04 90 2016-04-05 83 2016-04-06 87 Freq: D, dtype: int64 This type of Series with a DateTimeIndexis referred to as a time-series. We can look up a temperature on a specific data by using the date as a string: In [11]:temps1['2016-04-04'] Out [11]:90 Two Series objects can be applied to each other with an arithmetic operation. The following code creates a second Series and calculates the difference in temperature between the two: In [12]:# create a second series of values using the same index temps2 = Series([70, 75, 69, 83, 79, 77], index = dates) # the following aligns the two by their index values # and calculates the difference at those matching labels temp_diffs = temps1 - temps2 temp_diffs Out [12]:2016-04-01 10 2016-04-02 7 2016-04-03 16 2016-04-04 7 2016-04-05 4 2016-04-06 10 Freq: D, dtype: int64 The result of an arithmetic operation (+, -, /, *, …) on two Series objects that are non-scalar values returns another Series object. Since the index is not integerwe can also then look up values by 0-based value: In [13]:# and also possible by integer position as if the # series was an array temp_diffs[2] Out [13]: 16 Finally, pandas provides many descriptive statistical methods. As an example, the following returns the mean of the temperature differences: In [14]: # calculate the mean of the values in the Serie temp_diffs Out [14]: 9.0 The pandas DataFrame A pandas Series can only have a single value associated with each index label. To have multiple values per index label we can use a DataFrame. A DataFrame represents one or more Series objects aligned by index label.  Each Series will be a column in the DataFrame, and each column can have an associated name.— In a way, a DataFrame is analogous to a relational database table in that it contains one or more columns of data of heterogeneous types (but a single type for all items in each respective column). The following creates a DataFrame object with two columns and uses the temperature Series objects: In [15]:# create a DataFrame from the two series objects temp1 # and temp2 and give them column names temps_df = DataFrame( {'Missoula': temps1, 'Philadelphia': temps2}) temps_df Out [15]: Missoula Philadelphia 2016-04-01 80 70 2016-04-02 82 75 2016-04-03 85 69 2016-04-04 90 83 2016-04-05 83 79 2016-04-06 87 77 The resulting DataFrame has two columns named Missoula and Philadelphia. These columns are new Series objects contained within the DataFrame with the values copied from the original Series objects. Columns in a DataFrame object can be accessed using an array indexer [] with the name of the column or a list of column names. The following code retrieves the Missoula column: In [16]:# get the column with the name Missoula temps_df['Missoula'] Out [16]:2016-04-01 80 2016-04-02 82 2016-04-03 85 2016-04-04 90 2016-04-05 83 2016-04-06 87 Freq: D, Name: Missoula, dtype: int64 And the following code retrieves the Philadelphia column: In [17]:# likewise we can get just the Philadelphia column temps_df['Philadelphia'] Out [17]:2016-04-01 70 2016-04-02 75 2016-04-03 69 2016-04-04 83 2016-04-05 79 2016-04-06 77 Freq: D, Name: Philadelphia, dtype: int64 A Python list of column names can also be used to return multiple columns.: In [18]:# return both columns in a different order temps_df[['Philadelphia', 'Missoula']] Out [18]: Philadelphia Missoula 2016-04-01 70 80 2016-04-02 75 82 2016-04-03 69 85 2016-04-04 83 90 2016-04-05 79 83 2016-04-06 77 87 There is a subtle difference in a DataFrame object as compared to a Series object. Passing a list to the [] operator of DataFrame retrieves the specified columns whereas aSerieswould return rows. If the name of a column does not have spaces it can be accessed using property-style: In [19]:# retrieve the Missoula column through property syntax temps_df.Missoula Out [19]:2016-04-01 80 2016-04-02 82 2016-04-03 85 2016-04-04 90 2016-04-05 83 2016-04-06 87 Freq: D, Name: Missoula, dtype: int64 Arithmetic operations between columns within a DataFrame are identical in operation to those on multiple Series. To demonstrate, the following code calculates the difference between temperatures using property notation: In [20]:# calculate the temperature difference between the two #cities temps_df.Missoula - temps_df.Philadelphia Out [20]:2016-04-01 10 2016-04-02 7 2016-04-03 16 2016-04-04 7 2016-04-05 4 2016-04-06 10 Freq: D, dtype: int64 A new column can be added to DataFrame simply by assigning another Series to a column using the array indexer [] notation. The following adds a new column in the DataFrame with the temperature differences: In [21]:# add a column to temp_df which contains the difference # in temps temps_df['Difference'] = temp_diffs temps_df Out [21]: Missoula Philadelphia Difference 2016-04-01 80 70 10 2016-04-02 82 75 7 2016-04-03 85 69 16 2016-04-04 90 83 7 2016-04-05 83 79 4 2016-04-06 87 77 10 The names of the columns in a DataFrame are accessible via the.columns property. : In [22]:# get the columns, which is also an Index object temps_df.columns Out [22]:Index(['Missoula', 'Philadelphia', 'Difference'], dtype='object') The DataFrame (and Series) objects can be sliced to retrieve specific rows. The following slices the second through fourth rows of temperature difference values: In [23]:# slice the temp differences column for the rows at # location 1 through 4 (as though it is an array) temps_df.Difference[1:4] Out [23]: 2016-04-02 7 2016-04-03 16 2016-04-04 7 Freq: D, Name: Difference, dtype: int64 Entire rows from a DataFrame can be retrieved using the .loc and .iloc properties. .loc ensures that the lookup is by index label, where .iloc uses the 0-based position. - The following retrieves the second row of the DataFrame. In[24]:# get the row at array position 1 temps_df.iloc[1] Out [24]:Missoula 82 Philadelphia 75 Difference 7 Name: 2016-04-02 00:00:00, dtype: int64 Notice that this result has converted the row into a Series with the column names of the DataFrame pivoted into the index labels of the resulting Series.The following shows the resulting index of the result. In [25]:# the names of the columns have become the index # they have been 'pivoted' temps_df.iloc[1].index Out [25]:Index(['Missoula', 'Philadelphia', 'Difference'], dtype='object') Rows can be explicitly accessed via index label using the .loc property. The following code retrieves a row by the index label: In [26]:# retrieve row by index label using .loc temps_df.loc['2016-04-05'] Out [26]:Missoula 83 Philadelphia 79 Difference 4 Name: 2016-04-05 00:00:00, dtype: int64 Specific rows in a DataFrame object can be selected using a list of integer positions. The following selects the values from the Differencecolumn in rows at integer-locations 1, 3, and 5: In [27]:# get the values in the Differences column in rows 1, 3 # and 5using 0-based location temps_df.iloc[[1, 3, 5]].Difference Out [27]:2016-04-02 7 2016-04-04 7 2016-04-06 10 Freq: 2D, Name: Difference, dtype: int64 Rows of a DataFrame can be selected based upon a logical expression that is applied to the data in each row. The following shows values in the Missoulacolumn that are greater than 82 degrees: In [28]:# which values in the Missoula column are > 82? temps_df.Missoula > 82 Out [28]:2016-04-01 False 2016-04-02 False 2016-04-03 True 2016-04-04 True 2016-04-05 True 2016-04-06 True Freq: D, Name: Missoula, dtype: bool The results from an expression can then be applied to the[] operator of a DataFrame (and a Series) which results in only the rows where the expression evaluated to Truebeing returned: In [29]:# return the rows where the temps for Missoula > 82 temps_df[temps_df.Missoula > 82] Out [29]: Missoula Philadelphia Difference 2016-04-03 85 69 16 2016-04-04 90 83 7 2016-04-05 83 79 4 2016-04-06 87 77 10 This technique is referred to as boolean election in pandas terminologyand will form the basis of selecting rows based upon values in specific columns (like a query in —SQL using a WHERE clause - but as we will see also being much more powerful). Visualization We will dive into visualization in quite some depth in Chapter 14, Visualization, but prior to then we will occasionally perform a quick visualization of data in pandas. Creating a visualization of data is quite simple with pandas. All that needs to be done is to call the .plot() method. The following demonstrates by plotting the Close value of the stock data. In [40]: df[['Close']].plot(); Summary In this article, took an introductory look at the pandas Series and DataFrame objects, demonstrating some of the fundamental capabilities. This exposition showed you how to perform a few basic operations that you can use to get up and running with pandas prior to diving in and learning all the details. Resources for Article: Further resources on this subject: Using indexes to manipulate pandas objects [article] Predicting Sports Winners with Decision Trees and pandas [article] The pandas Data Structures [article]
Read more
  • 0
  • 0
  • 3151

article-image-machine-learning-review
Packt
18 Jul 2017
20 min read
Save for later

Machine Learning Review

Packt
18 Jul 2017
20 min read
In this article by Uday Kamath and Krishna Choppella, authors for the book Mastering Java Machine Learning, will discuss how in recent years a revival of interest is seen in the area ofartificial intelligence (AI)and machine learning, in particular, both in academic circles and industry. In the last decade, AI has seen dramatic successes that eluded practitioners in the intervening years since the original promise of the field gave way to relative decline until its re-emergence in the last few years. (For more resources related to this topic, see here.) What made these successes possible, in large part, was the availability of prodigious amounts of data and the inexorable increase in raw computational power. Among the areas of AI leading the resurgence, machine learning has seen spectacular developments and continues to find the widest applicability in an array of domains. The use of machine learning to help in complex decision making at the highest levels of business, and at the same time, its enormous success in improving the accuracy of what are now everyday applications, such assearch, speech recognition, and personal assistants on mobile phones,has made its effects commonplace in the family room and the boardroom alike. Articles breathlessly extolling the power of "deep learning" can be found today not only in the popular science and technology press, but also in mainstream outlets such as The New York Times and The Huffington Post. Machine learning has indeed become ubiquitous in a relatively short time. An ordinary user encounters machine learning in many ways in his day-to-day activities. Interacting with well-known e-mail providers such as Gmail gives the user automated sorting and categorization of e-mails into categories, such as spam, junk, promotions, and so on,which is made possible using text mining, a branch of machine learning. When shopping online for products on ecommerce websites such as https://www.amazon.com/ or watching movies from content providers such as http://netflix.com/, one is offered recommendations for other products and content by so-called recommender systems, another branch of machine learning. Forecasting the weather, estimating real estate prices, predicting voter turnout and even election results—all use some form of machine learningto see into the future as it were. The ever-growing availability of data and the promise of systems that can enrich our lives by learning from that data place a growing demand on skills from a limited workforce of professionals in the field of data science. This demand is particularly acute for well-trained experts who know their way around the landscape of machine learning techniques in the more popular languages, including Java, Python, R, and increasingly, Scala. By far, the number and availability of machine learning libraries, tools, APIs, and frameworks in Java outstrip those in other languages. Consequently, mastery of these skills will put any aspiring professional with a desire to enter the field at a distinct advantage in the marketplace. Perhaps you already apply machine learning techniques in your professional work, or maybe you simply have a hobbyist's interest in the subject.Clearly, you can bend Java to your will, but now you feel you're ready to dig deeper and learn how to use thebest of breed open-source ML Java frameworks in your next data science project. Mastery of a subject, especially one that has such obvious applicability as machine learning, requires more than an understanding of its core concepts and familiarity with its mathematical underpinnings. Unlike an introductory treatment of the subject, a project that purports to help you master the subject must be heavily focused on practical aspects in addition to introducing more advanced topics that would have stretched the scope of the introductory material.To warm up before we embark on sharpening our instrument, we will devote this article to a quick review of what we already know.For the ambitious novice with little or no prior exposure to the subject (who is nevertheless determined to get the fullest benefit from this article), here's our advice: make sure you do not skip the rest of this article instead, use it as a springboard to explore unfamiliar concepts in more depth. Seek out external resources as necessary.Wikipedia it. Then jump right back in. For the rest of this article, we will review the following: History and definitions What is not machine learning? Concepts and terminology Important branches of machine learning Different data types in machine learning Applications of machine learning Issues faced in machine learning The meta-process used in most machine learning projects Information on some well-known tools, APIs,and resources that we will employ in this article Machine learning –history and definition It is difficult to give an exact history, but the definition of machine learning we use today finds its usage as early as in the 1860s.In Rene Descartes' Discourse on the Method, he refers to Automata and saysthe following: For we can easily understand a machine's being constituted so that it can utter words, and even emit some responses to action on it of a corporeal kind, which brings about a change in its organs; for instance, if touched in a particular part it may ask what we wish to say to it; if in another part it may exclaim that it is being hurt, and so on http://www.earlymoderntexts.com/assets/pdfs/descartes1637.pd https://www.marxists.org/reference/archive/descartes/1635/discourse-method.htm Alan Turing, in his famous publication Computing Machinery and Intelligence, gives basic insights into the goals of machine learning by asking the question "Can machines think?". http://csmt.uchicago.edu/annotations/turing.htm http://www.csee.umbc.edu/courses/471/papers/turing.pdf Arthur Samuel, in 1959,wrote,"Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.". Tom Mitchell, in recent times, gave a more exact definition of machine learning:"A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E." Machine Learning has a relationship with several areas: Statistics: This uses the elements of data sampling, estimation, hypothesis testing, learning theory, and statistical based modeling, to name a few Algorithms and computation: This uses basics of search, traversal, parallelization, distributed computing, and so on from basic computer science Database and knowledge discovery: This has the ability to store, retrieve, access information in various formats Pattern recognition: This has the ability to find interesting patterns from the data either to explore, visualize, or predict Artificial Intelligence: Though it is considered a branch of artificial intelligence, it also has relationships with other branches, such as heuristics, optimization, evolutionary computing, and so on. What is not machine learning? It is important to recognize areas that share a connection with machine learning but cannot themselves be considered as being part of machine learning. Some disciplines may overlap to a smaller or larger extent, yet the principles underlying machine learning are quite distinct: Business intelligence and reporting: Reporting Key Performance Indicators (KPIs), querying OLAP for slicing, dicing, and drilling into the data, dashboards,and so on. that form central components of BI are not machine learning. Storage and ETL: Data storage and ETL are key elements needed in any machine learning process, but by themselves, they don't qualify as machine learning. Information retrieval, search, and queries: The ability to retrieve the data or documents based on search criteria or indexes, which form the basis of information retrieval, are not really machine learning. Many forms of machine learning, such as semi-supervised learning, can rely on search of similar data for modeling but that doesn't qualify search as machine learning. Knowledge representation and reasoning: Representing knowledge for performing complex tasks such as Ontology, Expert Systems, and Semantic Web do not qualify as machine learning. Machine learning –concepts and terminology In this article, we will describe different concepts and terms normally used in machine learning: Data or dataset: The basics of machine learning rely on understanding the data. The data or dataset normally refers to content available in structured or unstructured format for using in machine learning. Structured datasets have specific formats, and an unstructured dataset is normally in the form of some free flowing text. Data can be available in various storage types or formats. In structured data, every element known as an instance or an example or row follows a predefined structure. Data can be also be categorized by size; small or medium data have a few hundreds to thousands of instances, whereas big data refers to large volume, mostly in the millions or billions, which cannot be stored or accessed using common devices or fit in the memory of such devices. Features, attributes, variables or dimensions: In structured datasets, as mentioned earlier, there are predefined elements with their own semantic and data type, which are known variously as features, attributes, variables, or dimensions. Data types: The preceding features defined need some form of typing in many machine learning algorithms or techniques. The most commonly used data types are as follows: Categorical or nominal: This indicates well-defined categories or values present in the dataset. For example, eye color, such as black, blue, brown, green, or grey; document content type, such as text, image, or video. Continuous or numeric: This indicates the numeric nature of the data field. For example, a person's weight measured by a bathroom scale, temperature from a sensor, the monthly balance in dollars on a credit card account. Ordinal: This denotes the data that can be ordered in some way. For example, garment size, such as small, medium, or large; boxing weight classes, such as heavyweight, light heavyweight, middleweight, lightweight,and bantamweight. Target or label: A feature or set of features in the dataset, which is used for learning from training data and predicting in unseen dataset, is known as a target or a label. A label can have any form as specified earlier, that is, categorical, continuous, or ordinal. Machine learning model: Each machine learning algorithm, based on what it learned from the dataset, maintains the state of its learning for predicting or giving insights into future or unseen data. This is referred to as the machine learning model. Sampling: Data sampling is an essential step in machine learning. Sampling means choosing a subset of examples from a population with the intent of treating the behavior seen in the (smaller) sample as being representative of the behavior of the (larger) population. In order for the sample to be representative of the population, care must be taken in the way the sample is chosen. Generally, a population consists of every object sharing the properties of interest in the problem domain,for example,all people eligible to vote in the general election, all potential automobile owners in the next four years.Since it is usually prohibitive (or impossible) to collect data for all the objects in a population, a well-chosen subset is selected for the purposes of analysis.A crucial consideration in the sampling process is that the sample be unbiased with respect to the population. The following are types of probability based sampling: Uniform random sampling: A sampling method when the sampling is done over a uniformly distributed population, that is, each object has an equal probability of being chosen. Stratified random sampling: A sampling method when the data can be categorized into multiple classes.In such cases, in order to ensure all categories are represented in the sample, the population is divided into distinct strata based on these classifications, and each stratum is sampled in proportion to the fraction of its class in the overall population. Stratified sampling is common when the population density varies across categories, and it is important to compare these categories with the same statistical power. Cluster sampling: Sometimes there are natural groups among the population being studied, and each group is representative of the whole population.An example is data that spans many geographical regions. In cluster sampling you take a random subset of the groups followed by a random sample from within each of those groups to construct the full data sample.This kind of sampling can reduce costs of data collection without compromising the fidelity of distribution in the population. Systematic sampling: Systematic or interval sampling is used when there is a certain ordering present in the sampling frame (a finite set of objects treated as the population and taken to be the source of data for sampling, for example, the corpus of Wikipedia articles arranged lexicographically by title). If the sample is then selected by starting at a random object and skipping a constant k number of object before selecting the next one, that is called systematic sampling.K is calculated as the ratio of the population and the sample size. Model evaluation metrics: Evaluating models for performance is generally based on different evaluation metrics for different types of learning. In classification, it is generally based on accuracy, receiver operating characteristics (ROC) curves, training speed, memory requirements, false positive ratio,and so on. In clustering, the number of clusters found, cohesion, separation, and so on form the general metrics. In stream-based learning apart from preceding standard metrics mentioned, adaptability, speed of learning, and robustness to sudden changes are some of the conventional metrics for evaluating the performance of the learner. To illustrate these concepts, a concrete example in the form of a well-known weather dataset is given.The data gives a set of weather conditions and a label that indicates whether the subject decided to play a game of tennis on the day or not: @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no The dataset is in the format of an ARFF (Attribute-Relation File Format) file. It consists of a header giving the information about features or attributes with their data types and actual comma separated data following the data tag. The dataset has five features: outlook, temperature, humidity, windy, and play. The features outlook and windy are categorical features, while humidity and temperature are continuous. The feature play is the target and is categorical. Machine learning –types and subtypes We will now explore different subtypes or branches of machine learning. Though the following list is not comprehensive, it covers the most well-known types: Supervised learning: This is the most popular branch of machine learning, which is about learning from labeled data. If the data type of the label is categorical, it becomes a classification problem, and if numeric, it becomes a regression problem. For example, if the target of the dataset is detection of fraud, which has categorical values of either true or false, we are dealing with a classification problem. If, on the other hand, the target is to predict thebest price to list the sale of a home at, which is a numeric dollar value, the problem is one of regression. The following diagram illustrates labeled data that is conducive to classification techniques that are suitable for linearly separable data, such as logistic regression: Linearly separable data An example of dataset that is not linearly separable. This type of problem calls for classification techniques such asSupport Vector Machines. Unsupervised learning: Understanding the data and exploring it in order to buildmachine learning models when the labels are not given is called unsupervised learning. Clustering, manifold learning, and outlier detection are techniques that are covered in this topic. Examples of problems that require unsupervised learning are many; grouping customers according to their purchasing behavior is one example.In the case of biological data, tissues samples can be clustered based on similar gene expression values using unsupervised learning techniques The following diagram represents data with inherent structure that can be revealed as distinct clusters using an unsupervised learning technique such as K-Means: Clusters in data Different techniques are used to detect global outliers—examples that are anomalous with respect to the entire data set, and local outliers—examples that are misfits in their neighborhood. In the following diagam, the notion of local and global outliers is illustrated for a two-feature dataset: Local and Global outliers Semi-supervised learning: When the dataset has some labeled data and large data, which is not labeled, learning from such dataset is called semi-supervised learning. When dealing with financial data with the goal to detect fraud, for example, there may be a large amount of unlabeled data and only a small number of known fraud and non-fraud transactions.In such cases, semi-supervised learning may be applied. Graph mining: Mining data represented as graph structures is known as graph mining. It is the basis of social network analysis and structure analysis in different bioinformatics, web mining, and community mining applications. Probabilistic graphmodeling and inferencing: Learning and exploiting structures present between features to model the data comes under the branch of probabilistic graph modeling. Time-series forecasting: This is a form of learning where data has distinct temporal behavior and the relationship with time is modeled.A common example is in financial forecasting, where the performance of stocks in a certain sector may be the target of the predictive model. Association analysis:This is a form of learning where data is in the form of an item set or market basket and association rules are modeled to explore and predict the relationships between the items. A common example in association analysis is to learn relationships between the most common items bought by the customers when they visit the grocery store. Reinforcement learning: This is a form of learning where machines learn to maximize performance based on feedback in the form of rewards or penalties received from the environment. A recent example that famously used reinforcement learning was AlphaGo, the machine developed by Google that beat the world Go champion Lee Sedoldecisively, in March 2016.Using a reward and penalty scheme, the model first trained on millions of board positions in the supervised learning stage, then played itselfin the reinforcement learning stage to ultimately become good enough to triumph over the best human player. http://www.theatlantic.com/technology/archive/2016/03/the-invisible-opponent/475611/ https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf Stream learning or incremental learning: Learning in supervised, unsupervised, or semi-supervised manner from stream data in real time or pseudo-real time is called stream or incremental learning. Learning the behaviors of sensors from different types of industrial systems for categorizing into normal and abnormal needs real time feed and detection. Datasets used in machine learning To learn from data, we must be able to understand and manage data in all forms.Data originates from many different sources, and consequently, datasets may differ widely in structure or have little or no structure at all.In this section, we present a high level classification of datasets with commonly occurring examples. Based on structure, dataset may be classified as containing the following: Structured or record data: Structured data is the most common form of dataset available for machine learning. The data is in theform of records or rows following a well-known format with features that are either columns in a table or fields delimited by separators or tokens. There is no explicit relationship between the records or instances. The dataset is available mostly in flat files or relational databases. The records of financial transactions at a bank shown in the following screenshotare an example of structured data: Financial Card Transactional Data with labels of Fraud. Transaction or market data: This is a special form of structured data whereeach corresponds to acollection of items. Examples of market dataset are the list of grocery item purchased by different customers, or movies viewed by customers as shown in the following screenshot: Market Dataset for Items bought from grocery store. Unstructured data: Unstructured data is normally not available in well-known formats such as structured data. Text data, image, and video data are different formats of unstructured data. Normally, a transformation of some form is needed to extract features from these forms of data to the aforementioned structured datasets so that traditional machine learning algorithms can be applied: Sample Text Data from SMS with labels of spam and ham from by Tiago A. Almeida from the Federal University of Sao Carlos. Sequential data: Sequential data have an explicit notion of order to them. The order can be some relationship between features and time variable in time series data, or symbols repeating in some form in genomic datasets. Two examples are weather data and genomic sequence data. The following diagram shows the relationship between time and the sensor level for weather: Time Series from Sensor Data Three genomic sequences are taken into consideration to show the repetition of the sequences CGGGT and TTGAAAGTGGTG in all the three genomic sequences: Genomic Sequences of DNA as sequence of symbols. Graph data: Graph data is characterized by the presence of relationships between entities in the data to form a graph structure. Graph datasets may be in structured record format or unstructured format. Typically, the graph relationship has to be mined from the dataset. Claims in the insurance domain can be considered structured records containingrelevant claims details withclaimants related through addresses/phonenumbers,and so on.This can be viewed in graph structure. Using theWorld Wide Web as an example, we have web pages available as unstructured datacontaininglinks,and graphs of relationships between web pages that can be built using web links, producing some of the most mined graph datasets today: Insurance Claim Data, converted into graph structure with relationship between vehicles, drivers, policies and addresses Machine learning applications Given the rapidly growing use of machine learning in diverse areas of human endeavor, any attempt to list typical applications in the different industries, where some form of machine learning is in use,must necessarily be incomplete. Nevertheless, in this section we list a broad set of machine learning applications by domain, uses and the type of learning used: Domain/Industry Applications Machine Learning Type Financial Credit Risk Scoring, Fraud Detection, Anti-Money Laundering Supervised, Unsupervised, Graph Models, Time Series, and Stream Learning Web Online Campaigns, Health Monitoring, Ad Targeting Supervised, Unsupervised, Semi-Supervised Healthcare Evidence-based Medicine, Epidemiological Surveillance, Drug Events Prediction, Claim Fraud Detection Supervised, Unsupervised, Graph Models, Time Series, and Stream Learning Internet of Thing (IoT) Cyber Security, Smart Roads, Sensor Health Monitoring Supervised, Unsupervised, Semi-Supervised, and Stream Learning Environment Weather forecasting, Pollution modeling, Water quality measurement Time Series, Supervised, Unsupervised, Semi-Supervised, and Stream Learning Retail Inventory, Customer Management and Recommendations, Layout and Forecasting Time Series, Supervised, Unsupervised, Semi-Supervised, and Stream Learning Summary: A revival of interest is seen in the area of artificial intelligence (AI)and machine learning, in particular, both in academic circles and industry. The use of machine learning  is to help in complex decision making at the highest levels of business. It has also achieved enormous success in improving the accuracy of everyday applications, such as search, speech recognition, and personal assistants on mobile phones. The basics of machine learning rely on understanding of data.Structured datasets have specific formats, and an unstructured dataset is normally in the form of some free flowing text. Machine learning is of two types: Supervised learning is the popular branch of machine learning, which is about learning from labeled data and Unsupervised learning is understanding the data and exploring it in order to build machine learning models when the labels are not given.  Resources for Article: Further resources on this subject: Specialized Machine Learning Topics [article] Machine learning in practice [article] Introduction to Machine Learning with R [article]
Read more
  • 0
  • 0
  • 17953

article-image-introduction-moodle-3
Packt
17 Jul 2017
13 min read
Save for later

Introduction to Moodle 3

Packt
17 Jul 2017
13 min read
In this article, Ian Wild, the author of the book, Moodle 3.x Developer's Guide will be intoroducing you to Moodle 3.For any organization considering implementing an online learning environment, Moodle is often the number one choice. Key to its success is the free, open source ethos that underpins it. Not only is the Moodle source code fully available to developers, but Moodle itself has been developed to allow for the inclusion of third party plugins. Everything from how users access the platform, the kinds of teaching interactions that are available, through to how attendance and success can be reported – in fact all the key Moodle functionality – can be adapted and enhanced through plugins. (For more resources related to this topic, see here.) What is Moodle? There are three reasons why Moodle has become so important, and much talked about, in the world of online learning, one technical, one philosophical and the third educational. From a technical standpoint Moodle - an acronym for Modular Object Oriented Dynamic Learning Environment – is highly customizable. As a Moodle developer, always remember that the ‘M’ in Moodle stands for modular. If you are faced with a client feature request that demands a feature Moodle doesn’t support then don’t panic. The answer is simple: we create a new custom plugin to implement it. Check out the Moodle Plugins Directory (https://moodle.org/plugins/) for a comprehensive library of supported 3rd party plugins that Moodle developers have created and given back to the community. And this leads to the philosophical reason why Moodle dominates. Free open source software for education Moodle is grounded firmly in a community-based, open source philosophy (see https://en.wikipedia.org/wiki/Open-source_model). But what does this mean for developers? Fundamentally, it means that we have complete access to the source code and, within reason, unfettered access to the people who develop it. Access to the application itself is free – you don’t need to pay to download it and you don’t need to pay to run it. But be aware of what ‘free’ means in this context. Hosting and administration, for example, take time and resources and are very unlikely to be free. As an educational tool, Moodle was developed to support social constructionism (see https://docs.moodle.org/31/en/Pedagogy) – if you are not familiar with this concept then it is essentially suggesting that building an understanding of a concept or idea can be best achieved by interacting with a broad community. The impact on us as Moodle plugin developers is that there is a highly active group of users and developers. Before you begin developing any Moodle plugins come and join us at https://moodle.org. Plugin Development – Authentication In this article, we will be developing a novel plugin that will seamlessly integrate Moodle and the WordPress content management system. Our plugin will authorize users via WordPress when they click on a link to Moodle when on a WordPress page. The plugin discussed in this article has already been released to the Moodle community – check out the Moodle Plugins Directory at https://moodle.org/plugins/auth_wordpress for details. Let us start by learning what Moodle authentication is and how new user accounts are created. Authentication Moodle supports a range of different authentication methods out of the box, each one supported by its own plugin. To go to the list of available plugins, from the Administration block, click on Site administration, click Plugins, then click Authentication, and finally click on Manage authentication. The list of currently installed authentication plugins is displayed: Each plugin interfaces with an internal Application Programming Interface (API), the Access API – see the Moodle developer documentation for details here: https://docs.moodle.org/dev/Access_API Getting logged in There are two ways of prompting the Moodle authentication process: Attempting to log in from the log in page. Clicking on a link to a protected resource (i.e. a page or file that you can’t view or download without logging in). For an overview of the process, take a look in the developer documentation at https://docs.moodle.org/dev/Authentication_plugins#Overview_of_Moodle_authentication_process. After checks to determine if an upgrade is required (or if we are partway through the upgrade process), there is a short fragment of code that loads the configured authentication plugins and for each one calls a special method called loginpage_hook(): $authsequence = get_enabled_auth_plugins(true); // auths, in sequence foreach($authsequence as $authname) { $authplugin = get_auth_plugin($authname); $authplugin->loginpage_hook(); } The loginpage_hook() function gives each authentication plugin the chance to intercept the login. Assuming that the login has not been intercepted, the process then continues with a check to ensure the supplied username conforms to the configured standard before calling authenticate_user_login() which, if successful, returns a $user object. OAuth Overview The OAuth authentication mechanism provides secure delegated access. OAuth supports a number of scenarios, including: A client requests access from a server and the server responds with either a ‘confirm’ or ‘deny’. This is called two legged authentication A client requests access from a server and the server, the server then pops up a confirmation dialog so that the user can authorize the access, and then it finally responds with either a ‘confirm’ or ‘deny’. This is called three legged authentication In this article we will be implementing such a mechanism means, in practice, that: An authentication server will only talk to configured clients No passwords are exchanged between server and client – only tokens are exchanged, which are meaningless on their own By default, users need to give permission before resources are accessed Having given an overview, here is the process again described in a little more detail: A new client is configured in the authentication server. A client is allocated a unique client key, along with a secret token (referred to as client secret) The client POSTs an HTTP request to the server (identifying itself using the client key and client secret) and the server responds with a temporary access token. This token is used to request authorization to access protected resources from the server. In this case ‘protected resources’ mean the WordPress API. Access to the WordPress API will allow us to determine details of the currently logged in user. The server responds not with an HTTP response but by POSTing new permanent tokens back to the client via a callback URI (i.e. the server talks to the client directly in order to ensure security). The process ends with the client possessing permanent authorization tokens that can be used to access WP-API functions. Obviously, the most effective way of learning about this process is to implement it so let’s go ahead and do that now. Installing the WordPress OAuth 1.0a server The first step will be to add the OAuth 1.0a server plugin to WordPress. Why not the more recent OAuth 2.0 server plugin? This is because 2.0 only supports https:// and not http://. Also, internally (at least at time of writing) WordPress will only authenticate internally using either OAuth 1.0a or cookies. Log into WordPress as an administrator and, from the Dashboard, hover the mouse over the Plugins menu item and click on Installed Plugins. The Plugins page is displayed. At the top of the page, press the Add New button: As described previously, ensure that you install version 1.0a and not 2.0: Once installed, we need to configure a new client. From the Dashboard menu, hover the mouse over Users and you will see a new Applications menu item has been added. Click on this to display the Registered Applications page. Click the Add New button to display the Add Application page: The Consumer Name is the title for our client that will appear in the Applications page, and Description is a brief explanation of that client to aid with the identification. The Callback is the URI that WordPress will talk to (refer to the outline of the OAuth authentication steps). As we have not yet developed the Moodle/OAuth client end yet you can specify oob in Callback (this stands for ‘out of band’). Once configured, WordPress will generate new OAuth credentials, a Client Key and a Client Secret: Having installed and configured the server end now it’s time to develop the client. Creating a new Moodle auth plugin Before we begin, download the finished plugin from https://github.com/iandavidwild/moodle-auth_wordpress and install it in your local development Moodle instance. The development of a new authentication plugin is described in the developer documentation at https://docs.moodle.org/dev/Authentication_plugins. As described there, let us start with copying the none plugin (the no login authentication method) and using this as a template for our new plugin. In Eclipse, I’m going to copy the none plugin to a new authentication method called wordpress: That done, we need to update the occurrences of auth_none to auth_wordpress. Firstly, rename /auth/wordpress/lang/en/auth_none.php to auth_wordpress.php. Then, in auth.php we need to rename the class auth_plugin_none to auth_plugin_wordpress. As described, the Eclipse Find/Replace function is great for updating scripts: Next, we need to update version information in version.php. Update all the relevant names, descriptions and dates. Finally, we can check that Moodle recognises our new plugin by navigating to the Site administration menu and clicking on Notifications. If installation is successful, our new plugin will be listed on the Available authentication plugins page: Configuration Let us start with considering the plugin configuration. We will need to allow a Moodle administrator to configure the following: The URL of the WordPress installation The client key and client secret provided by WordPress There is very little flexibility in the design of an authentication plugin configuration page so at this stage, rather than creating a wireframe drawing and having this agreed with the client, we can simply go ahead and write the code. The configuration page is defined in /config.html. Remember to start declaring the relevant language strings in /lang/en/auth_wordpress.php. Configuration settings themselves will be managed by the Moodle framework by calling our plugin’s process_config() method. Here is the declaration: /** * Processes and stores configuration data for this authentication plugin. * * @return @bool */ function process_config($config) { // Set to defaults if undefined if (!isset($config->wordpress_host)) { $config->wordpress_host = ‘‘; } if (!isset($config->client_key)) { $config->client_key = ‘‘; } if (!isset($config->client_secret)) { $config->client_secret = ‘‘; } set_config(‘wordpress_host’, trim($config->wordpress_host), ‘auth/wordpress’); set_config(‘client_key’, trim($config->client_key), ‘auth/wordpress’); set_config(‘client_secret’, trim($config->client_secret), ‘auth/wordpress’); return true; } Having dealt with configuration, now let us start managing the actual OAuth process. Handling OAuth calls Rather than go into the details of how we can send HTTP requests to WordPress let’s use a third party library to do this work. The code I’m going to use is based on Abraham Williamstwitteroauth library (see https://github.com/abraham/twitteroauth). In Eclipse, take a look at the files OAuth.php and BasicOAuth.php for details. To use the library, you will need to add the following lines to the top of /wordpress/auth.php: require_once($CFG->dirroot . ‘/auth/wordpress/OAuth.php’); require_once($CFG->dirroot . ‘/auth/wordpress/BasicOAuth.php’); use OAuth1BasicOauth; Let’s now start work on handling the Moodle login event. Handling the Moodle login event When a user clicks on link to a protected resource Moodle calls loginpage_hook() in each enabled authentication plugin. To handle this, let us first implement loginpage_hook(). We need to add the following lines to auth.php: /** * Will get called before the login page is shown. * */ function loginpage_hook() { $client_key = $this->config->client_key; $client_secret = $this->config->client_secret; $wordpress_host = $this->config->wordpress_host; if( (strlen($wordpress_host) > 0) && (strlen($client_key) > 0) && (strlen($client_secret) > 0) ) { // kick ff the authentication process $connection = new BasicOAuth($client_key, $client_secret); // strip the trailing slashes from the end of the host URL to avoid any confusion (and to make the code easier to read) $wordpress_host = rtrim($wordpress_host, ‘/’); $connection->host = $wordpress_host . “/wp-json”; $connection->requestTokenURL = $wordpress_host . “/oauth1/request”; $callback = $CFG->wwwroot . ‘/auth/wordpress/callback.php’; $tempCredentials = $connection->getRequestToken($callback); // Store temporary credentials in the $_SESSION }// if } This implements the first leg of the authentication process and the variable $tempCredentials will now contain a temporary access token. We will need to store these temporary credentials and then call on the server to ask the user to authorize the connection (leg two). Add the following lines immediately after the // Store temporary credentials in the $_SESSION comment: $_SESSION[‘oauth_token’] = $tempCredentials[‘oauth_token’]; $_SESSION[‘oauth_token_secret’] = $tempCredentials[‘oauth_token_secret’]; $connection->authorizeURL = $wordpress_host . “/oauth1/authorize”; $redirect_url = $connection->getAuthorizeURL($tempCredentials); header(‘Location: ‘ . $redirect_url); die; Next, we need to implement the OAuth callback. Create a new script called callback.php: The callback.php script will need to: Sanity check the data being passed back from WordPress and fail gracefully if there is an issue Get the wordpress authentication plugin instance (an instance of auth_plugin_wordpress) Call on a handler method that will perform the authentication (which we will then need to implement) The script is simple, short and available here: https://github.com/iandavidwild/moodle-auth_wordpress/blob/MOODLE_31_STABLE/callback.php Now, in the auth.php script, we need to implement the callback_handler() method to auth_plugin_wordpress. You can check out the code in GitHub. Visit https://github.com/iandavidwild/moodle-auth_wordpress/blob/MOODLE_31_STABLE/auth.php and scroll down to the call_backhandler() method. Lastly, let us add a fragment of code to the loginpage_hook() method that allows us to turn off WordPress authentication in config.php. Add the following to the very beginning of the loginpage_hook() function: global $CFG; if(isset($CFG->disablewordpressauth) && ($CFG->disablewordpressauth == true)) { return; } Summary In this article, we introduced the Moodle learning platform, investigated the open source philosophy that underpins it, and how Moodle’s functionality can be extended and enhanced through plugins. We took a pre-existing plugin to develop a new WordPress authentication module. This will allow a user logged into WordPress to automatically log into Moodle. To do so we implemented three legged Oauth 1.0a WordPress to Moodle authentication. Check out the complete code at https://github.com/iandavidwild/moodle-auth_wordpress/blob/MOODLE_31_STABLE/callback.php. More information on the plugin described in this article is available from the main Moodle website at https://moodle.org/plugins/auth_wordpress. Resources for Article:   Further resources on this subject: Introduction to Moodle [article] Moodle for Online Communities [article] An Introduction to Moodle 3 and MoodleCloud [article]
Read more
  • 0
  • 0
  • 15137

article-image-initial-configuration-sco-2016
Packt
17 Jul 2017
13 min read
Save for later

Initial Configuration of SCO 2016

Packt
17 Jul 2017
13 min read
In this article by Michael Seidl, author of the book Microsoft System Center 2016 Orchestrator Cookbook - Second Edition, will show you how to setup Orchestrator Environment and how to deploy and configure Orchestrator Integration Packs. (For more resources related to this topic, see here.) Deploying an additional Runbook designer Runbook designer is the key feature to build your Runbooks. After the initial installation, Runbook designer is installed on the server. For your daily work with orchestrator and Runbooks, you would like to install the Runbook designer on your client or on admin server. We will go through these steps in this recipe. Getting ready You must review the planning the Orchestrator deployment recipe before performing the steps in this recipe. There are a number of dependencies in the planning recipe you must perform in order to successfully complete the tasks in this recipe. You must install a management server before you can install the additional Runbook Designers. The user account performing the installation has administrative privileges on the server nominated for the SCO deployment and must also be a member of OrchestratorUsersGroup or equivalent rights. The example deployment in this recipe is based on the following configuration details: Management server called TLSCO01 with a remote database is already installed System Center 2016 Orchestrator How to do it... The Runbook designer is used to build Runbooks using standard activities and or integration pack activities. The designer can be installed on either a server class operating system or a client class operating system. Follow these steps to deploy an additional Runbook Designer using the deployment manager: Install a supported operating system and join the active directory domain in scope of the SCO deployment. In this recipe the operating system is Windows 10. Ensure you configure the allowed ports and services if the local firewall is enabled for the domain profile. See the following link for details: https://technet.microsoft.com/en-us/library/hh420382(v=sc.12).aspx. Log in to the SCO Management server with a user account with SCO administrative rights. Launch System Center 2016 Orchestrator Deployment Manager: Right-click on Runbook designers, and select Deploy new Runbook Designer: Click on Next on the welcome page. Type the computer name in the Computer field and click on Add. Click on Next. On the Deploy Integration Packs or Hotfixes page check all the integration packs required by the user of the Runbook designer (for this example we will select the AD IP). Click on Next. Click on Finish to begin the installation using the Deployment Manager. How it works... The Deployment Manager is a great option for scaling out your Runbook Servers and also for distributing the Runbook Designer without the need for the installation media. In both cases the Deployment Manager connects to the Management Server and the database server to configure the necessary settings. On the target system the deployment manager installs the required binaries and optionally deploys the integration packs selected. Using the Deployment Manager provides a consistent and coordinated approach to scaling out the components of a SCO deployment. See also The following official web link is a great source of the most up to date information on SCO: https://docs.microsoft.com/en-us/system-center/orchestrator/ Registering an SCO Integration Pack Microsoft System Center 2016 Orchestrator (SCO) automation is driven by process automation components. These process automation components are similar in concept to a physical toolbox. In a toolbox you typically have different types of tools which enable you to build what you desire. In the context of SCO these tools are known as Activities. Activities fall into two main categories: Built-in Standard Activities: These are the default activity categories available to you in the Runbook Designer. The standard activities on their own provide you with a set of components to create very powerful Runbooks. Integration Pack Activities: Integration Pack Activities are provided either by Microsoft, the community, solution integration organizations, or are custom created by using the Orchestrator Integration Pack Toolkit. These activities provide you with the Runbook components to interface with the target environment of the IP. For example, the Active Directory IP has the activities you can perform in the target Active Directory environment. This recipe provides the steps to find and register the second type of activities into your default implementation of SCO. Getting ready You must download the Integration Pack(s) you plan to deploy from the provider of the IP. In this example we will be deploying the Active Directory IP, which can be found at the following link: https://www.microsoft.com/en-us/download/details.aspx?id=54098. You must have deployed a System Center 2016 Orchestrator environment and have full administrative rights in the environment. How to do it... The following diagram provides a visual summary and order of the tasks you need to perform to complete this recipe: We will deploy the Microsoft Active Directory (AD) integration pack (IP). Integration pack organization A good practice is to create a folder structure for your integration packs. The folders should reflect versions of the IPs for logical grouping and management. The version of the IP will be visible in the console and as such you must perform this step after you have performed the step to load the IP(s). This approach will aid in change management when updating IPs in multiple environments. Follow these steps to deploy the Active Directory integration pack. Identify the source location for the Integration Pack in scope (for example, the AD IP for SCO2016). Download the IP to a local directory on the Management Server or UNC share. Log in to the SCO Management server. Launch the Deployment Manager: Under Orchestrator Management Server, right-click on Integration Packs. Select Register IP with the Orchestrator Management Server: Click on Next on the welcome page. Click on Add on the Select Integration Packs or Hotfixes page. Navigate to the directory where the target IP is located, click on Open, and then click on Next. Click on Finish . Click on Accept on End-User License Agreement to complete the registration. Click on Refresh to validate if the IP has successfully been registered. How it works... The process of loading an integration pack is simple. The prerequisite for successfully registering the IP (loading) is ensuring you have downloaded a supported IP to a location accessible to the SCO management server. Additionally the person performing the registration must be a SCO administrator. At this point we have registered the Integration Pack to our Deployment Wizard, 2 Steps are still necessary before we can use the Integration Pack, see our following Recipe for this. There's more... Registering the IP is the first part of the process of making the IP activities available to Runbook designers and Runbook Servers. The next Step has to be the Deployment of Integration Packs to Runbook Designer. See the next Recipe for that. Orchestrator Integration Packs are provided not only by Microsoft, also third party Companies like Cisco or NetAPP are providing OIP’s for their Products. Additionally there is a huge Community which are providing Orchestrator Integration Packs. There are several Sources of downloading Integration Packs, here are two useful links: http://www.techguy.at/liste-mit-integration-packs-fuer-system-center-orchestrator/ http://scorch.codeplex.com/ https://www.microsoft.com/en-us/download/details.aspx?id=54098 Deploying the IP to designers and Runbook servers Registering the Orchestrator Integration Pack is only the first step, you also need to deploy the OIP to your Designer or Runbook Server. Getting Ready You have to follow the steps described in Recipe Registering an SCO Integration Pack before you can start with the next steps to deploy an OIP. How to do it In our example we will deploy the Active Direcgtory Integration Pack to our Runbooks Desginer. Follow these steps to deploy the Active Directory integration pack. Once the IP in scope (AD IP in our example) has successfully been registered, follow these steps to deploy it to the Runbook Designers and Runbook Servers. Log in to the SCO Management server and launch Deployment Manager: Under Orchestrator Management Server, right-click on the Integration Pack in scope and select Deploy IP to Runbook Server or Runbook Designer: Click on Next on the welcome page, select the IP you would like to deploy (in our example, System Center Integration Pack for Active Directory ,  and then click on Next. On the computer Selection page. Type the name of the Runbook Server or designer  in scope and click on Add (repeat for all servers in the scope).  On the Installation Options page you have the following three options: Schedule the Installation: select this option if you want to schedule the deployment for a specific time. You still have to select one of the next two options. Stop all running Runbooks before installing the Integration Packs  or Hotfixes: This option will as described stop all current Runbooks in the environment. Install the Integration Packs or Hotfixes without stopping the running Runbooks: This is the preferred option if you want to have a controlled deployment without impacting current jobs: Click on Next after making your installation option selection. Click on Finish The integration pack will be deployed to all selected designers and Runbook servers. You must close all Runbook designer consoles and re-launch to see the newly deployed Integration Pack. How it works… The process of deploying an integration pack is simple. The pre-requisite for successfully deploying the IP (loading) is ensuring you have registered a supported IP in the SCO management server. Now we have successfully deployed an Orchestrator Integration Pack. If you have deployed it to a Runbook designer, make sure you close and reopen the designer to be able to use the activities in this Integration Pack. Now your are able to use these activities to build your Runbooks, the only thing you have to do, is to follow our next recipe and configure this Integration Pack. This steps can be used for each single Integration Pack, also deploy multiple OIP with one deployment. There’s more… You have to deploy an OIP to every single Designer and Runbook Server, where you want to work with the Activities. Doesn’t matter if you want to edit a Runbook with the Designer or want to run a Runbook on a special Runbook Server, the OIP has to be deployed to both. With Orchestrator Deployment Manager, this is a easy task to do. Initial Integration Pack configuration This recipe provides the steps required to configure an integration pack for use once it has been successfully deployed to a Runbook designer. Getting ready You must deploy an Orchestrator environment and also deploy the IP you plan to configure to a Runbook designer before following the steps in this recipe. The authors assume the user account performing the installation has administrative privileges on the server nominated for the SCO Runbook designer. How to do it... Each integration pack serves as an interface to the actions SCO can perform in the target environment. In our example we will be focusing on the Active Directory connector. We will have two accounts under two categories of AD tasks in our scenario: IP name Category of actions Account name Active Directory Domain Account Management SCOAD_ACCMGT Active Directory Domain Administrator Management SCOAD_DOMADMIN The following diagram provides a visual summary and order of the tasks you need to perform to complete this recipe: Follow these steps to complete the configuration of the Active Directory IP options in the Runbook Designer: Create or identify an existing account for the IP tasks. In our example we are using two accounts to represent two personas of a typical active directory delegation model. SCOAD_ACCMGT is an account with the rights to perform account management tasks only and SCOAD_DOMADMIN is a domain admin account for elevated tasks in Active Directory. Launch the Runbook Designer as a SCO administrator, select Options from the menu bar, and select the IP to configure (in our example, Active Directory). Click on Add, type AD Account Management in the Name: field, select Microsoft Active Directory Domain Configuration in the Type field by clicking on the. In the Properties section type the following: Configuration User Name: SCOAD_ACCMGT Configuration Password: Enter the password for SCOAD_ACCMGT Configuration Domain Controller Name (FQDN): The FQDN of an accessible domain controller in the target AD (In this example, TLDC01.TRUSTLAB.LOCAL). Configuration Default Parent Container: This is an optional field. Leave it blank: Click on OK. Repeat steps 3 and 4 for the Domain Admin account and click on Finish to complete the configuration. How it works... The IP configuration is unique for each system environment SCO interfaces with for the tasks in scope of automation. The active directory IP configuration grants SCO the rights to perform the actions specified in the Runbook using the activities of the IP. Typical Active Directory activities include, but are not limited to creating user and computer accounts, moving user and computer accounts into organizational units, or deleting user and computer accounts. In our example we created two connection account configurations for the following reasons: Follow the guidance of scoping automation to the rights of the manual processes. If we use the example of a Runbook for creating user accounts we do not need domain admin access. A service desk user performing the same action manually would typically be granted only account management rights in AD. We have more flexibility with delegating management and access to Runbooks. Runbooks with elevated rights through the connection configuration can be separated from Runbooks with lower rights using folder security. The configuration requires planning and understanding of its implication before implementing. Each IP has its own unique options which you must specify before you create Runbooks using the specified IP. The default IPs that you can download from Microsoft include the documentation on the properties you must set. There’s more… As you have seen in this recipe, we need to configure each additional Integration Pack with a Connections String, User and Password. The built in Activities from SCO, are using the Service Account rights to perform this Actions, or you can configure a different User for most of the built in Activities.  See also The official online documentation for Microsoft Integration Packs is updated regularly and should be a point for reference at https://www.microsoft.com/en-us/download/details.aspx?id=54098 The creating and maintaining a security model for Orchestrator in this article expands further on the delegation model in SCO. Summary In this article, we have covered the following: Deploying an Additional Runbook Designer Registering an SCO Integration Pack Deploying an SCO Integration Pack to Runbook Designer and Server Initial Integration Pack Configuration Resources for Article: Further resources on this subject: Deploying the Orchestrator Appliance [article] Unpacking System Center 2012 Orchestrator [article] Planning a Compliance Program in Microsoft System Center 2012 [article]
Read more
  • 0
  • 0
  • 35877

article-image-introduction-iot
Packt
17 Jul 2017
11 min read
Save for later

Introduction to IOT

Packt
17 Jul 2017
11 min read
In this article by Kallol Bosu Roy Choudhuri, the author of the book Learn Arduino Prototyping in 10 days, we will learn about IoT. (For more resources related to this topic, see here.) As per Gartner, the number of connected devices around the world is going to reach 50 billion by the year 2020. Just imagine the magnitude and scale of the hyper-connectedness that is being forged every moment, as we read through this exciting article. Figure 1: A typical IoT scenario (automobile example) As we can see in the preceding figure, a typical IoT-based scenario is composed of the following fundamental building blocks: IoT edge device IoT cloud platform An IoT device is used to serve as a bridge between existing machines on the ground and an IoT cloud platform. The IoT cloud platform provides a cloud-based infrastructure backbone for data acquisition, data storage, and computing power for data analytics and reporting. The Arduino platform can be effectively used for prototyping IoT devices for almost any IoT solution very rapidly. Building the edge device In this section, we will learn how to use the ESP8266 Wi-Fi chip with the Arduino Uno for connecting to the Internet and posting data to an IoT cloud. There are numerous IoT cloud players in the market today, including Microsoft Azure and Amazon IoT. In this article, we will use the ThingSpeak IoT platform that is very simple to use with the Arduino platform. The following parts will be required for this prototype: 1 Arduino Uno R3 1 USB Cable 1 ESP8266-01 Wi-Fi Transceiver module 1 Breadboard 1 pc. 1K Ohms resistor 1 pc. 2k ohms resistor Some jumper wires Once all the parts have been assembled, follow the breadboard circuit shown in the following figure and build the edge device: Figure 2: ESP8266 with Arduino Uno Wiring The important facts to remember in the preceding setup are: The RXD pin of the ESP8266 chip should receive a 3.3V input signal. We have ensured this by employing the voltage division method. For test purposes, the preceding setup should work fine. However, the ESP8266 chip is demanding when it comes to power (read current) consumption, especially during transmission cycles. Just in case the ESP8266 chip does not respond to the Arduino sketch or AT commands properly, then the power supply may not be enough. Try using a separate battery for the setup. When using a separate battery, remember to use a voltage regulator that steps down the voltage to 3.3 volts before supplying the ESP8266 chip. For prolonged usage, a separate battery based power supply is recommended. Smart retail project inspiration In the previous sections, we looked at the basics of achieving IoT prototyping with the Arduino platform and an IoT cloud platform. With this basic knowledge, you are encouraged to start exploring more advanced IoT scenarios. As a future inspiration, the following smart retail project idea is being provided for you to try by applying the basic principles that you have learned in this article. After all, the goal of this article has been to show you the light and make you self-reliant with the Arduino platform. Imagine a large retail store where products are displayed in hundreds of aisles and thousands of racks. Such large warehouse-type retail layouts are common in some countries, usually with furniture sellers. One of the time-consuming tasks that these retail shops face is to keep the price of their displayed inventory matched with the ever changing competitive market rates. For example, the price of a sofa set could be marked at 350 dollars on aisle number 47 rack number 1. Now let's think from a customer’s perspective. Imagine being a potential customer, standing in front of the sofa set we would naturally search for the prices of that sofa set on the Internet. It would not be very surprising to find a similar sofa set that is priced a little lower, maybe at 325 dollars at another store. That is exactly how and when a potential customer would change her mind. The story after this is simple. The customer leaves store A, goes to store B and purchases the sofa set at 325 dollars. In order to address these types of lost sale opportunities, the furniture company management decides to lower the prices of the sofa set to 325 dollars, in order to match the competition. Thereafter, all that needs to be done is for someone to change the price label for the sofa set in aisle number 47 rack number 1, which is a 5–10 minute walk (considering the size of the shop floor) from the shop back office. In a localized store, it is still achievable without further loss of customers. Now, let's appreciate the problem by thinking hyperscale. The furniture seller’s management is located centrally, say in Sweden and they want to dynamically change the product prices for specific price-sensitive products that are displayed across more than 350 stores, in more than 40 countries. The price change should be automatic, near real time, and simultaneous across all company stores. Given the preceding problem statement, it seems a daunting task that could leave hundreds for shop floor staff scampering all day long, just to keep changing price tags for hundreds of products. An elegant solution to this problem lies in the Internet-of-Things. Figure 3: Smart retail with IoT Referring to the preceding figure, it depicts a basic IoT solution for addressing the furniture seller’s problem to matching product prices dynamically on the fly. Remember, since this is an Arduino prototyping article, the stress is on edge device prototyping. However, IoT as a whole; encompasses many areas that include edge devices and cloud platform capabilities. Our focus in this section is to be able to comprehend and build the edge device prototype that supports this smart retail IoT solution. In the preceding solution, the cloud platform takes care of running intelligent program batches to continuously analyze market rates for price-sensitive products. Once a new price has been determined by the cloud job/program, the new product prices are updated in the cloud-hosted database. Immediately after a new price change, all the IoT edge devices attached to the price tag of specific products in the company stores are notified of the new price. We can build a smart LCD panel for displaying product prices. For Internet connectivity, we can reuse the ESP8266 Wi-Fi chip that we learned in this article. Standalone Devices We are already familiar with the basic parts required for building a prototype. The two new aspects to consider when building a standalone project are an independent power source and a project container. Figure 4 - Typical parts of a Standalone Prototype As shown above, typically a standalone device prototype will contain the following parts: The device prototype (Arduino board + Peripherals + all the required Connections) An independent power source A project container/box After the basic prototype has been built the next consideration is to make it operable on its own, like an island. This is because in real world situations, you will often have to make a device that is not directly be connected to and powered from a computer. Therefore we will need to understand the various options that are available for powering our device prototypes and also understand when to choose which option. The second aspect to consider is an appropriate physical container to house the various parts of a project. A container is a physical device container will ensure that all parts of a project are nicely packaged in a safe and aesthetic manner. A distance measurement device Let's build an exciting project by combining an Ultrasonic Sensor with a 16x2 LCD character display to build an electronic distance measurement device. We will use one of the most easily available 9-volt batteries for powering this standalone device prototype. For building the distance measurement device, the following parts will be required. Arduino Uno R3 USB connector 1 pc. 9 volt battery 1 full sized bread board 1 HC-SR04 ultrasonic sensor 1 pc. 16x2 LCD character display 1 pc. 10K potentiometer 2 pcs. 220 ohms resistor 1 pc. 150 ohms resistor Some jumper wires Before we start building the device, let's understand what the device will do and the various parts involved in the device. The purpose of the device will be to be able to measure the distance of an object from the device. The following diagram depicts the overview of the device: Figure 5 - A standalone distance measurement device overview First, let's quickly understand each of the components involved in the preceding setup. Then, we will jump into hands-on prototyping and coding. The ultrasonic sensor model used in this example is known as HC-SR04. HC-SR04 is a standard commercially available ultrasonic transceiver. A transceiver is a device that is capable of transmitting as well as receiving signals. The HC-SR04 transceiver transmits ultrasonic signals. Once the signals hit an object/obstacle, the signals echo back to the HC-SR04 transceiver. The HC-SR04 ultrasonic module is shown below for reference. Figure 6 - The HC-SR04 Ultrasonic module The HC-SR-04 has four pins. The usage of the pins is explained below for easy understanding: Vcc: This pin is connected to a 5 volt power supply. Trig: This pin receives digital signals from the attached microcontroller unit in order to send out an ultrasonic burst. Echo: This pin sends the measured time duration proportional to the distance travelled by the ultrasonic burst. Gnd: This pin is connected to the ground terminal. The total time taken for the ultrasonic signals to echo back from an obstacle can be divided by 2 and then based on the speed of sound in air, the distance between the object and the HC-SR04 can be calculated. We will see how to calculate the distance in the sketch for this device prototype. As per the HC-SR04 data sheet, it is a 5-volt tolerant device, operating at 15 mA, and has a measurement range starting from 2 centimeters to a maximum of 4 meters. The HC-SR04 can be directly connected to the Arduino board pins. The 16x2 LCD character display is also a standard commercially available device, it has 16 columns and 2 rows. The LCD is controlled by its 4 data pins/lines. We will also see how to send string outputs to the LCD from the Arduino sketch. The power supply being used in today's example is a standard 9-volt battery plugged in Arduino's DC IN Jack. Alternatively, another option is to use 6 pieces of either AA-sized or AAA-sized batteries in series and plug them into the VIN pin of the Arduino board. Distance measurement device circuit Follow the bread board diagram shown next to build the distance measurement device. The diagram shown on the next page is quite complex. Take your time as you unravel through it. All the components (including the Arduino board) in this prototype are powered from the 9 volt battery. Sometimes the LCD procured online might not ship with soldered header pins. In such a case, you will have to solder 16 header pins. It is very important to note that unless the header pins are soldered properly into the LCD board, the LCD screen will not work correctly. This is a very challenging prototype to get it working at one go. Make sure there are no loose jumper wires. Notice how the positive and negative terminals of the power source are plugged into the VIN and GND pins of the Arduino board respectively. The 10K potentiometer has three legs. If you look straight at the breadboard diagram, the left hand side leg of the potentiometer is connected to the 5V power supply rail of the breadboard. Figure 7 - Typical potentiometersr The right hand-side leg is connected to the common ground rail of the breadboard. The leg in the middle is the regulated (via the potentiometer's 10K resistance dial) output that controls the LCD's V0/VEE pin. Basically, this pin controls the contrast of the display. You will also have to adjust (a simple screw driver may be used) the 10K potentiometer dial (to around halfway at 5K) to make the characters visible on the LCD screen. Initially, you may not see anything on the LCD, until the potentiometer is adjusted properly. Figure 8 - Distance measurement device prototype When the 'Trig' pin receives a signal (via pin D8 in this example) and results in sending out ultrasonic waves to the surroundings. As soon as the ultrasonic waves collide with an obstacle, they get reflected. The reflected ultrasonic waves are received by the HC-SR04 sensor. The Echo pin is used to read the output of the ultrasonic sensor (via pin D7 in this example). The output read from the 'Echo' pin is processed by the Arduino sketch to calculate the distance. Summary Thus an IoT device is used to serve as a bridge between existing machines on the ground and an IoT cloud platform. Resources for Article: Further resources on this subject: Introducing IoT with Particle's Photon and Electron [article] IoT and Decision Science [article] IoT Analytics for the Cloud [article]
Read more
  • 0
  • 0
  • 31389

article-image-massive-graphs-big-data
Packt
11 Jul 2017
19 min read
Save for later

Massive Graphs on Big Data

Packt
11 Jul 2017
19 min read
In this article by Rajat Mehta, author of the book Big Data Analytics with Java, we will learn about graphs. Graphs theory is one of the most important and interesting concepts of computer science. Graphs have been implemented in real life in a lot of use cases. If you use a GPS on your phone or a GPS device and it shows you a driving direction to a place, behind the scene there is an efficient graph that is working for you to give you the best possible direction. In a social network you are connected to your friends and your friends are connected to other friends and so on. This is a massive graph running in production in all the social networks that you use. You can send messages to your friends or follow them or get followed all in this graph. Social networks or a database storing driving directions all involve massive amounts of data and this is not data that can be stored on a single machine, instead this is distributed across a cluster of thousands of nodes or machines. This massive data is nothing but big data and in this article we will learn how data can be represented in the form of a graph so that we make analysis or deductions on top of these massive graphs. In this article, we will cover: A small refresher on the graphs and its basic concepts A small introduction on graph analytics, its advantages and how Apache Spark fits in Introduction on the GraphFrames library that is used on top of Apache Spark Before we dive deeply into each individual section, let's look at the basic graph concepts in brief. (For more resources related to this topic, see here.) Refresher on graphs In this section, we will cover some of the basic concepts of graphs, and this is supposed to be a refresher section on graphs. This is a basic section, hence if some users already know this information they can skip this section. Graphs are used in many important concepts in our day-to-day lives. Before we dive into the ways of representing a graph, let's look at some of the popular use cases of graphs (though this is not a complete list) Graphs are used heavily in social networks In finding driving directions via GPS In many recommendation engines In fraud detection in many financial companies In search engines and in network traffic flows In biological analysis As you must have noted earlier, graphs are used in many applications that we might be using on a daily basis. Graphs is a form of a data structure in computer science that helps in depicting entities and the connection between them. So, if there are two entities such as Airport A and Airport B and they are connected by a flight that takes, for example, say a few hours then Airport A and Airport B are the two entities and the flight connecting between them that takes those specific hours depict the weightage between them or their connection. In formal terms, these entities are called as vertexes and the relationship between them are called as edges. So, in mathematical terms, graph G = {V, E},that is, graph is a function of vertexes and edges. Let's look at the following diagram for the simple example of a graph: As you can see, the preceding graph is a set of six vertexes and eight edges,as shown next: Vertexes = {A, B, C, D, E, F} Edges = { AB, AF, BC, CD, CF, DF, DE, EF} These vertexes can represent any entities, for example, they can be places with the edges being 'distances' between the places or they could be people in a social network with the edges being the type of relationship, for example, friends or followers. Thus, graphs can represent real-world entities like this. The preceding graph is also called a bidirected graph because in this graph the edges go in either direction that is the edge from A to B can be traversed both ways from A to B as well as from B to A. Thus, the edges in the preceding diagrams that is AB can be BA or AF can be FA too. There are other types of graphs called as directed graphs and in these graphs the direction of the edges go in one way only and does not retrace back. A simple example of a directed graph is shown as follows:. As seen in the preceding graph, the edge A to B goes only in one direction as well as the edge B to C. Hence, this is a directed graph. A simple linked list data structure or a tree datastructure are also forms of graph only. In a tree, nodes can have children only and there are no loops, while there is no such rule in a general graph. Representing graphs Visualizing a graph makes it easily comprehensible but depicting it using a program requires two different approaches Adjacency matrix: Representing a graph as a matrix is easy and it has its own advantages and disadvantages. Let's look at the bidirected graph that we showed in the preceding diagram. If you would represent this graph as a matrix, it would like this: The preceding diagram is a simple representation of our graph in matrix form. The concept of matrix representation of graph is simple—if there is an edge to a node we mark the value as 1, else, if the edge is not present, we mark it as 0. As this is a bi-directed graph, it has edges flowing in one direction only. Thus, from the matrix, the rows and columns depict the vertices. There if you look at the vertex A, it has an edge to vertex B and the corresponding matrix value is 1. As you can see, it takes just one step or O[1]to figure out an edge between two nodes. We just need the index (rows and columns) in the matrix and we can extract that value from it. Also, if you would have looked at the matrix closely, you would have seen that most of the entries are zero, hence this is a sparse matrix. Thus, this approach eats a lot of space in computer memory in marking even those elements that do not have an edge to each other, and this is its main disadvantage. Adjacency list: Adjacency list solves the problem of space wastage of adjacency matrix. To solve this problem, it stores the node and its neighbors in a list (linked list) as shown in the following diagram: For maintaining brevity, we have not shown all the vertices but you can make out from the diagram that each vertex is storing its neighbors in a linked list. So when you want to figure out the neighbors of a particular vertex, you can directly iterate over the list. Of course this has the disadvantage of iterating when you have to figure out whether an edge exists between two nodes or not. This approach is also widely used in many algorithms in computer science. We have briefly seen how graphs can be represented, let's now see some important terms that are used heavily on graphs. Common terminology on graphs We will now introduce you to some common terms and concepts in graphs that you can use in your analytics on top of graphs: Vertices:As we mentioned earlier, vertices are the mathematical terms for the nodes in a graph. For analytic purposes, thevertices count shows the number of nodes in the system, for example, the number of people involved in a social graph. Edges: As we mentioned earlier, edges are the connection between vertices and edges can carry weights. The number of edges represent the number of relations in a system of graph. The weight on a graph represents the intensity of the relationship between the nodes involved; for example, in a social network, the relationship of friends is a stronger relationship than followed between nodes. Degrees: Represent the total number of connections flowing into as well as out of a node. For example, in the previous diagram the degree of node F is four. The degree count is useful, for example, in a social network graph it can represent how well a person is connected if his degree count is very high. Indegrees: This represents the number of connections flowing into a node. For example, in the previous diagram, for node F the indegree value is three. In a social network graph, this might represent how many people can send messages to this person or node. Oudegrees: This represents the number of connections flowing out of a node. For example, in the previous diagram, for node F the outdegree value is one. In a social network graph, this might represent how many people can send messages to this person or node. Common algorithms on graphs Let's look at the three common algorithms that are run on graphs frequently and some of their uses: Breadth first search: Breadth first search is an algorithm for graph traversal or searching. As the name suggests, the traversal occurs across the breadth of the graphs that is to say the neighbors of the node form where traversal starts are searched first before exploring further in the same manner. We will refer to the same graph we used earlier: If we start at vertex A, then according to breadth first search next we search or go at the neighbors of A that's B and F. After that, we will go at the neighbors of B and that will be C. Next we will go to the neighbours of F and those will be E and D. We only go through each node once and this mimics real-life travel as well, as to reach from a point to another point we seldom cover the same road or path again. Thus, our breadth first traversal starting from A will be {A , B , F , C , D , E }. Breadth first search is very useful in graph analytics and can tell us things such as the friends that are not your immediate friends but just at the next level after your immediate friends in a social network or in the case of a graph of a flights network it can show flights with just a single stop or two stops to the destination. Depth first search: This is another way of searching where we start from the source vertice and keep on searching until we reach the end node or the leaf node and then we backtrack. This algorithm is not as performant as the bread first search as it requires lots of traversals. So if you want to know if a node A is connected to node B, you might end up searching along a lot of wasteful nodes that do not have anything to do with the original nodes A and B before coming at the appropriate solution. Dijkstra's shortest path: This is a greedy algorithm to find the shortest path in a graph network. So in a weighted graph, if you need to find the shortest path between two nodes, you can start from the starting node and keep on picking the next node in path greedily to be the one with the least weight (in the case of weights being distances between nodes like in city graphs depicting interconnecting cities and roads).So in a road network, you can find the shortest path between two cities using this algorithm. PageRank algorithm: This is a very popular algorithm that came out from Google and it essentially is used to find the importance of a web page by figuring out how connected it is to other important websites. It gives a page rank score to each of the websites based on this approach and finally the search results are built based on this score. The best part about this algorithm is it can be applied to other areas in life too, for example, in figuring out the important airports in a flight graph, or figuring out the most important people in a social network group. So much for the basics and refresher on graphs, in the next section, we will now see how graphs can be used in real world in massive datasets such as social network data or in data used in the field of biology. We will also study how graph analytics can be used on top of these graphs to derive exclusive deductions. Plotting graphs There is a handy open source Java library called GraphStream, which can be used to plot graphs and this is very useful specially if you want to view the structure of your graphs. While viewing, you can also figure out if some of the vertices are very close to each other (clustered) or in general how they are placed. Using the GraphStream library is easy. Just download the jar from http://graphstream-project.org and put it in the classpath of your project. Next, we will show a simple example demonstrating how easy it is to plot a graph using this library. Just create an instance of a graph. For our example, we will create a simple DefaultGraph and name it SimpleGraph. Next, we will add the nodes or vertices of the graph. We will also add the attribute of the label that is displayed on the vertice. Graph graph = newDefaultGraph("SimpleGraph"); graph.addNode("A" ).setAttribute("ui.label", "A"); graph.addNode("B" ).setAttribute("ui.label", "B"); graph.addNode("C" ).setAttribute("ui.label", "C"); After building the nodes, it's now time to connect these nodes using the edges. The API is simple to use and on the graph instance we can define the edges, provided an ID is given to them and the starting and ending nodes are also given.   graph.addEdge("AB", "A", "B"); graph.addEdge("BC", "B", "C"); graph.addEdge("CA", "C", "A"); All the information of nodes and edges is present on the graph instance. It's now time to plot this graph on the UI and we can just invoke the display method on the graph instance as shown next and display it on the UI. graph.display(); This would plot the graph on the UI as follows: This library is extensive and it will be good learning experience to explore this library further and we would urge the readers to further explore this library on their own. Massive graphs on big data Big data comprises huge amount of data distributed across a cluster of thousands (if not more) of machines. Building graphs based on this massive data has different challenges shown as follows: Due to the vast amount of data involved, the data for the graph is distributed across a cluster of machines. Hence, in actuality, it's not a single node graph and we have to build a graph that spans across a cluster of machines. A graph that spans across a cluster of machines would have vertices and edges spread across different machines and this data in a graph won't fit into the memory of one single machine. Consider your friend's list on Facebook; some of your friend's data in your Facebook friend list graph might lie on different machines and this data might be just tremendous in size. Look at an example diagram of a graph of 10 Facebook friends and their network shown as follows: As you can see in the preceding diagram, when for just 10 friends the data can be huge, and here since the graph is drawn by hand we have not even shown a lot of connections to make the image comprehensible, but in real life each person can have say more than thousands of connections. So imagine what will happen to a graph with say thousands if not more people on the list. As shown in the reasons we just saw, building massive graphs on big data is a different ball game altogether and there are few main approaches for building this massive graphs. From the perspective of big data building the massive graphs involve running and storing data parallely on many nodes. The two main approaches are bulk synchronous parallely and the pregel approach. Apache Spark follows the pregel approach. Covering these approaches in detail is out of scope of this book and if the users are interested more on these topics they should refer to other books and the Wikipedia for the same. Graph analytics The biggest advantage to using graphs is you can analyze these graphs and use them for analyzing complex datasets. You might ask what is so special about graph analytics that we can't do by relational databases. Let's try to understand this using an example, suppose we want to analyze your friends network on Facebook and pull information about your friends such as their name, their birth date, their recent likes, and so on. If Facebook had a relational database, then this would mean firing a query on some table using the foreign key of the user requesting this info. From the perspective of relational database, this first level query is easy. But what if we now ask you to go to the friends at level four in your network and fetch their data (as shown in the following diagram). The query to get this becomes more and more complicated from a relational database perspective but this is a trivial task on a graph or graphical database (such as Neo4j). Graphs are extremely good on operations where you want to pull information from one end of the node to another, where the other node lies after a lot of joins and hops. As such, graph analytics is good for certain use cases (but not for all use cases, relation database are still good on many other use cases). As you can see, the preceding diagram depicts a huge social network (though the preceding diagram might just be depicting a network of a few friends only). The dots represent actual people in a social network. So if somebody asks to pick one user on the left-most side of the diagram and see and follow host connections to the right-most side and pull the friends at the say 10th level or more, this is something very difficult to do in a normal relational database and doing it and maintaining it could easily go out of hand. There are four particular use cases where graph analytics is extremely useful and used frequently (though there are plenty more use cases too) Path analytics: As the name suggests, this analytics approach is used to figure out the paths as you traverse along the nodes of a graph. There are many fields where this can be used—simplest being road networks and figuring out details such as shortest path between cities, or in flight analytics to figure out the shortest time taking flight or direct flights. Connectivity analytics: As the name suggests, this approach outlines how the nodes within a graph are connected to each other. So using this you can figure out how many edges are flowing into a node and how many are flowing out of the node. This kind of information is very useful in analysis. For example, in a social network if there is a person who receives just one message but gives out say ten messages within his network then this person can be used to market his favorite products as he is very good in responding to messages. Community Analytics: Some graphs on big data are huge. But within these huge graphs there might be nodes that are very close to each other and are almost stacked in a cluster of their own. This is useful information as based on this you can extract out communities from your data. For example, in a social network if there are people who are part of some community, say marathon runners, then they can be clubbed into a single community and further tracked. Centrality Analytics: This kind of analytical approach is useful in finding central nodes in a network or graph. This is useful in figuring out sources that are single handedly connected to many other sources. It is helpful in figuring out influential people in a social network, or a central computer in a computer network. From the perspective of this article, we will be covering some of these use cases in our sample case studies and for this we will be using a library on Apache Spark called GraphFrames. GraphFrames GraphX library is advanced and performs well on massive graphs, but, unfortunately, it's currently only implemented in Scala and does not have any direct Java API. GraphFrames is a relatively new library that is built on top of Apache Spark and provides support for dataframe (now dataset) based graphs.It contains a lot of methods that are direct wrappers over the underlying sparkx methods. As such it provides similar functionality as GraphX except that GraphX acts on the Spark SRDD and GraphFrame works on the dataframe so GraphFrame is more user friendly (as dataframes are simpler to use). All the advantages of firing Spark SQL queries, joining datasets, filtering queries are all supported on this. To understand GraphFrames and representing massive big data graphs, we will take small baby steps first by building some simple programs using GraphFrames before building full-fledged case studies. Summary In this article, we learned about graph analytics. We saw how graphs can be built even on top of massive big datasets. We learned how Apache Spark can be used to build these massive graphs and in the process we learned about the new library GraphFrames that helps us in building these graphs. Resources for Article: Further resources on this subject: Saying Hello to Java EE [article] Object-Oriented JavaScript [article] Introduction to JavaScript [article]
Read more
  • 0
  • 0
  • 25378
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-introduction-hololens
Packt
11 Jul 2017
10 min read
Save for later

Introduction to HoloLens

Packt
11 Jul 2017
10 min read
In this article, Abhijit Jana, Manish Sharma, and Mallikarjuna Rao, the authors of the book, HoloLens Blueprints, we will be covering the following points to introduce you to using HoloLens for exploratory data analysis. Digital Reality - Under the Hood Holograms in reality Sketching the scenarios 3D Modeling workflow Adding Air Tap on speaker Real-time visualization through HoloLens (For more resources related to this topic, see here.) Digital Reality - Under the Hood Welcome to the world of Digital Reality. The purpose of Digital Reality is to bring immersive experiences, such as taking or transporting you to different world or places, make you interact within those immersive, mix digital experiences with reality, and ultimately open new horizons to make you more productive. Applications of Digital Reality are advancing day by day; some of them are in the field of gaming, education, defense, tourism, aerospace, corporate productivity, enterprise applications, and so on. The spectrum and scenarios of Digital Reality are huge. In order to understand them better, they are broken down into three different categories:  Virtual Reality (VR): It is where you are disconnected from the real world and experience the virtual world. Devices available on the market for VR are Oculus Rift, Google VR, and so on. VR is the common abbreviation of Virtual Reality. Augmented Reality (AR): It is where digital data is overlaid over the real world. Pokémon GO, one of the very famous games, is an example of the this globally. A device available on the market, which falls under this category, is Google Glass. Augmented Reality is abbreviated to AR. Mixed Reality (MR): It spreads across the boundary of the real environment and VR. Using MR, you can have a seamless and immersive integration of the virtual and the real world. Mixed Reality is abbreviated to MR. This topic is mainly focused on developing MR applications using Microsoft HoloLens devices. Although these technologies look similar in the way they are used, and sometimes the difference is confusing to understand, there is a very clear boundary that distinguishes these technologies from each other. As you can see in the following diagram, there is a very clear distinction between AR and VR. However, MR has a spectrum, that overlaps across all three boundaries of real world, AR, and MR. Digital Reality Spectrum The following table describes the differences between the three: Holograms in reality Till now, we have mentioned Hologram several times. It is evident that these are crucial for HoloLens and Holographic apps, but what is a Hologram? Virtual Reality Complete Virtual World User is completely isolated from the Real World Device examples: Oculus Rift and Google VR Augmented Reality Overlays Data over the real world Often used for mobile devices Device example: Google Glass Application example: Pokémon GO Mixed Reality Seamless integration of the real and virtual world Virtual world interacts with Real world Natural interactions Device examples: HoloLens and Meta Holograms are the virtual objects which will be made up with light and sound and blend with the real world to give us an immersive MR experience with both real and virtual worlds. In other words, a Hologram is an object like any other real-world object; the only difference is that it is made up of light rather than matter. The technology behind making holograms is known as Holography. The following figure represent two holographic objects placed on the top of a real-size table and gives the experience of placing a real object on a real surface: Holograms objects in real environment Interacting with holograms There are basically five ways that you can interact with holograms and HoloLens. Using your Gaze, Gesture, and Voice and with spatial audio and spatial mapping. Spatial mapping provides a detailed illustration of the real-world surfaces in the environment around HoloLens. This allows developers to understand the digitalized real environments and mix holograms into the world around you. Gaze is the most usual and common one, and we start the interaction with it. At any time, HoloLens would know what you are looking at using Gaze. Based on that, the device can take further decisions on the gesture and voice that should be targeted. Spatial audio is the sound coming out from HoloLens and we use spatial audio to inflate the MR experience beyond the visual. HoloLens Interaction Model Sketching the scenarios The next step after elaborating scenario details is to come up with sketches for this scenario. There is a twofold purpose for sketching; first, it will be input to the next phase of asset development for the 3D Artist, as well as helping to validate requirements from the customer, so there are no surprises at the time of delivery. For sketching, either the designer can take it up on their own and build sketches, or they can take help from the 3D Artist. Let's start with the sketch for the primary view of the scenario, where the user is viewing the HoloLens's hologram: Roam around the hologram to view it from different angles Gaze at different interactive components Sketch for user viewing hologram for the HoloLens Sketching - interaction with speakers While viewing the hologram, a user can gaze at different interactive components. One such component, identified earlier, is the speaker. At the time of gazing at the speaker, it should be highlighted and the user can then Air Tap at it. The Air Tap action should expand the speaker hologram and the user should be able to view the speaker component in detail. Sketch for expanded speakers After the speakers are expanded, the user should be able to visualize the speaker components in detail. Now, if the user Air Taps on the expanded speakers, the application should do the following: Open the textual detail component about the speakers; the user can read the content and learn about the speakers in detail Start voice narration, detailing speaker details The user can also Air Tap on the expanded speaker component, and this action should close the expanded speaker Textual and voice narration for speaker details  As you did sketching for the speakers, apply a similar approach and do sketching for other components, such as lenses, buttons, and so on. 3D Modeling workflow Before jumping to 3D Modeling, let's understand the 3D Modeling workflow across different tools that we are going to use during the course of this topic. The following diagram explains the flow of the 3D Modeling workflow: Flow of 3D Modeling workflow Adding Air Tap on speaker In this project, we will be using the left-side speaker for applying Air Tap on speaker. However, you can apply the same for the right-side speaker as well. Similar to Lenses, we have two objects here which we need to identify from the object explorer. Navigate to Left_speaker_geo and left_speaker_details_geo in Object Hierarchy window Tag them as leftspeaker and speakerDetails respectively By default, when you are just viewing the Holograms, we will be hiding the speaker details section. This section only becomes visible when we do the Air Tap, and goes back again when we Air Tap again: Speaker with Box Collider Add a new script inside the Scripts folder, and name it ShowHideBehaviour. This script will handle the Show and Hide behaviour of the speakerDetails game object. Use the following script inside the ShowHideBehaviour.cs file. This script we can use for any other object to show or hide. public class ShowHideBehaviour : MonoBehaviour { public GameObject showHideObject; public bool showhide = false; private void Start() { try { MeshRenderer render = showHideObject.GetComponent InChildren<MeshRenderer>(); if (render != null) { render.enabled = showhide; } } catch (System.Exception) { } } } The script finds the MeshRenderer component from the gameObject and enables or disables it based on the showhide property. In this script, the showhide is property exposed as public, so that you can provide the reference of the object from the Unity scene itself. Attach ShowHideBehaviour.cs as components in speakerDetails tagged object. Then drag and drop the object in the showhide property section. This just takes the reference for the current speaker details objects and will hide the object in the first instance. Attach show-hide script to the object By default, it is unchecked, showhide is set to false and it will be hidden from view. At this point in time, you must check the left_speaker_details_geo on, as we are now handling visibility using code. Now, during the Air Tapped event handler, we can handle the render object to enable visibility. Add a new script by navigating from the context menu Create | C# Scripts, and name it SpeakerGestureHandler. Open the script file in Visual Studio. Similar to SpeakerGestureHandler, by default, the SpeakerGestureHandler class will be inherited from the MonoBehaviour. In the next step, implement the InputClickHandler interface in the SpeakerGestureHandler class. This interface implement the methods OnInputClicked() that invoke on click input. So, whenever you do an Air Tap gesture, this method is invoked. RaycastHit hit; bool isTapped = false; public void OnInputClicked(InputEventData eventData) { hit = GazeManager.Instance.HitInfo; if (hit.transform.gameObject != null) { isTapped = !isTapped; var lftSpeaker = GameObject.FindWithTag("LeftSpeaker"); var lftSpeakerDetails = GameObject.FindWithTag("speakerDetails"); MeshRenderer render = lftSpeakerDetails.GetComponentInChildren <MeshRenderer>(); if (isTapped) { lftSpeaker.transform.Translate(0.0f, -1.0f * Time.deltaTime, 0.0f); render.enabled = true; } else { lftSpeaker.transform.Translate(0.0f, 1.0f * Time.deltaTime, 0.0f); render.enabled = false; } } } When it is gazed, we find the game object for both LeftSpeaker and speakerDetails by the tag name. For the LeftSpeaker object, we are applying transformation based on tapped or not tapped, which worked like what we did for lenses. In the case of speaker details object, we have also taken the reference of MeshRenderer to make it's visibility true and false based on the Air Tap. Attach the SpeakerGestureHandler class with leftSpeaker Game Object. Air Tap in speaker – see it in action Air Tap action for speaker is also done. Save the scene, build and run the solution in emulator once again. When you can see the cursor on the speaker, perform Air Tap. Default View and Air Tapped View Real-time visualization through HoloLens We have learned about the data ingress flow, where devices connect with the IoT Hub, and stream analytics processes the stream of data and pushes it to storage. Now, in this section, let's discuss how this stored data will be consumed for data visualization within holographic application. Solution to consume data through services Summary In this article, we demonstrated using HoloLens,  for exploring Digital Reality - Under the Hood, Holograms in reality, Sketching the scenarios, 3D Modeling workflow, Adding Air Tap on speaker,  and  Real-time visualization through HoloLens. Resources for Article: Further resources on this subject: Creating Controllers with Blueprints [article] Raspberry Pi LED Blueprints [article] Exploring and Interacting with Materials using Blueprints [article]
Read more
  • 0
  • 0
  • 28306

article-image-hpc-cluster-computing
Packt
10 Jul 2017
8 min read
Save for later

HPC Cluster Computing

Packt
10 Jul 2017
8 min read
In this article by, Raja Malleswara Rao Pattamsetti, the author of the book Distributed Computing in Java 9, the author is going to look into the processing capabilities for organizational applications are more than a chronological computer configuration that can be addressed to an extent by increasing the processor capacity and other resource allocation. While this can alleviate the performance for a while, the future computational requirements are restricted for adding more powerful computational processors and cost incurred in producing such powerful systems. Also, there is a need to produce efficient algorithms and practices to produce the best result out of it. A practical and economic substitute for these single high-power computers is to establish multiple low-power capacity processors that work collectively and organize their processing capabilities, which results in a powerful system, that is, parallel computers that permit the processing activities to be distributed among multiple low-capacity computers and together obtain the best expected result.   In this article, we will cover the following topics: Era of Computing Commanding Parallel System Architectures MPP: Massively Parallel Processors SMP: Symmetric Multiprocessors CC-NUMA: Cache Coherent Nonuniform Memory Access Distributed Systems Clusters Java support for High-Performance Computing (For more resources related to this topic, see here.) Era of Computing The technological developments are rapid in the industry of computing with the help of advancements in system software and hardware. The hardware advancements are around the processor advancements and their making techniques, and high-performance processors are getting generated in amazingly low cost. The hardware advancements are further boosted by the high-bandwidth and low-latency network systems. Very Large Scale Integration (VLSI) as a concept brought several phenomenal advancements in producing commanding chronological and parallel processing computers. Simultaneously, the System Software advancements have improved the ability of operating system, advanced software programming development techniques. It was observed as two commanding computing eras, namely Sequential and Parallel Era of Computing. The following diagram shows the advancements in both the Era of Computing from last few decades to the forecast for next two decades:   In each of these era, it is observed that the hardware architecture growth is trailed by the system software advancements, which mean that as there was more powerful hardware evolved, correspondingly advanced software programs and operating system capacities doubled its strength. As the applications and problem solving environments are added to this with the advent of parallel computing, mini to microprocessor development, clustered computers. Let’s now review some of the commanding parallel system architectures from the last few years. Commanding parallel system architectures From the last few years, multiple varieties of computing models provisioning great processing performance have developed. They are classified depending on the memory allocated to their processors and their alignment are designed. Some of the important parallel systems are as follows: MPP: Massively Parallel Processors SMP: Symmetric Multiprocessors CC-NUMA: Cache Coherent Nonuniform Memory Access Distributed Systems Clusters Let’s now understand a little more detail about each of these system architectures and review some of their important performance characteristics. Massively Parallel Processors (MPP) Massively Parallel Processors (MPP), as the name advises, are a huge parallel processing system developed in no sharing architecture. Such systems usually contain large number of processing nodes that are integrated with a high-speed interconnected network switch. Node is nothing but an element of computer with diverse hardware component combination, usually containing a memory unit and more than on processor. Some of the purpose nodes are designed to have backup disks or additional storage capacity.The following diagram represents the massively parallel processors architecture:   Symmetric Multiprocessors (SMP) Symmetric Multiprocessors (SMP), as the name advises, contain a set of limited number of processors ranging from 2 to 64 processors and share most of the resources among those processors. One instance of the operating system will be operating together on all these connected processors while they commonly share the I/O, memory, and the network bus. Based on the nature of similar set of processors connected and acting together as on operating system is the essential behavior of symmetric multiprocessors.The following diagram depicts the symmetric multiprocessors representation: Cache Coherent Nonuniform Memory Access (CC-NUMA) Cache Coherent Nonuniform Memory Access (CC-NUMA) is a special type of multiple processor system having mountable additional processor capability.  In CC-NUMA, the SMP system and the other remote nodes communicate through the remote interconnection link. Each of the remote nodes contains local memory and processors of its own.The following diagram represents the cache coherent nonuniform memory access architecture: The nature of memory access is nonuniform. Just Like the symmetric multiprocessors, CC-NUMA system is a comprehensive sight to entire system memory, and as the name advises, it takes nonuniform time to access either the close or distant memory locations. Distributed systems Distributed systems, as we have been discussing from previous articles, are traditional individual set of computers interconnected through an Internet/intranet running on their own operating system. Any diverse set of computer systems can contribute to be part of the distributed system with this expectation, including the combinations of Massively Parallel Processors, Symmetric Multiprocessors, distinct computers, and clusters. Clusters Cluster is an assembly of terminals or computers that are assembled through internetwork connection to address a large processing requirement through parallel processing. Hence, the clusters are usually configured with terminals or computers having higher processing capabilities connected with high-speed internetwork. Usually, Cluster is considered as one image that integrates a number of nodes with a group of resource utilization. The following diagram is a sample clustered tomcat server for a typical J2EE web application deployment:   The following table shows some of the important performance characteristics of the different parallel architecture systems discussed so far: Network of workstations A network of workstations is a group of resources connected like system processors, interface for networks, storage, and data disks that open the space for newer combinations such as: Parallel Computing: As discussed in the previous sections, the group of system processor can be connected as MPP or DSM, which can obtain the parallel processing ability. RAM Network: As a number of systems are connected, each system memory can collectively work as DRAM cache, which intensely expand the virtual memory for the entire system to improve its processing ability. Software Redundant Array of Inexpensive Disks (RAID): As a group of systems are used in the network connected in array improve the system stability, availability, and memory capacity with the help of low-cost multiple systems connected in the local area network. This also gives the simultaneous I/O system support. Multipath Communication: Multipath communication is a technique of using more than one network connection between the network of workstations to allow simultaneous information exchange among system nodes.  Java support for High-Performance Computing Java is providing some numerous advantages for HPC (High-Performance Computing) as a programming language, particularly with Grid computing. Some of the key benefits of using Java in HPC include the following: Portability: The capability to write the programming in one platform and port it to run on any other operating platform has been the biggest strength of Java language. This continue to be the advantage when porting Java applications to HPC systems. In Grid computing, this is an essential feature, as the execution environment gets decided during the execution of the program. This is possible since the Java byte code executes in Java Virtual Machine (JVM), which itself acts as an abstract operating environment. Network Centricity: As discussed in previous articles, Java provides a great support for distributed systems with its network centric feature of remote procedure calls (RPC) through RMI and CORBA services. Along with these, Java support for socket programming is another great support for grid computing. Software Engineering: Another great feature of Java is the way it can be used for engineering the solutions. We can produce more object-oriented, loosely coupled software that can be independently added as API through jar files in multiple other systems. Encapsulation and Interface programming makes it a great advantage in such application development. Security: Java is highly secure programming language with its feature like byte code verification to limit the resource utilization by an untrusted intruder software or program. This is a great advantage when running application on distributed environment with remote communication established over shared network. GUI development: Java support for platform independent GUI development is a great advantage to develop and deploy the enterprise web applications over a HPC environment to interact with. Availability: Java is having a great supporting from multiple operating system including Windows, Linux, SGI, and Compaq. It is easily available to consume and develop using a great set of open source frameworks developed on top of it.   While the advantages listed in the previous points in favor of using Java for HPC environments, some concerns need to be reviewed while designing Java applications, including its numerics (complex numbers, fastfp, multidimensional), performance, and parallel computing designs. Summary Through this article, you have learned about era of computing, commanding parallel, system architectures, MPP: Massively Parallel Processors, SMP: Symmetric Multiprocessors, CC-NUMA: Cache Coherent Nonuniform Memory Access, Distributed Systems, Clusters, Java support for High-Performance Computing Resources for Article:  Further resources on this subject: The Amazon S3 SDK for Java [article] Gradle with the Java Plugin [article] Getting Started with Sorting Algorithms in Java [article]
Read more
  • 0
  • 0
  • 6992

article-image-queues-and-topics
Packt
10 Jul 2017
8 min read
Save for later

Queues and topics

Packt
10 Jul 2017
8 min read
In this article by Luca Stancapiano, the author of the book Mastering Java EE Development with WildFly, we will see how to implement Java Message Service (JMS) in a queue channel using WildFly console. (For more resources related to this topic, see here.) JMS works inside channels of messages that manage the messages asynchronously. These channels contain messages that they will collect or remove according the configuration and the type of channel. These channels are of two types, queues and topics. These channels are highly configurable by the WildFly console. As for all components in WildFly they can be installed through the console command line or directly with maven plugins of the project.  In the next two paragraphs we will show what do they mean and all the possible configurations. Queues Queues collect the sent messages that are waiting to be read. The messages are delivered in the order they are sent and when beds are removed from the queue. Create the queue from the web console See now the steps to create a new queue through the web console. Connect to http://localhost:9990/.  Go in Configuration | Subsystems/Messaging - ActiveMQ/default. And click on Queues/Topics. Now select the Queues menu and click on the Add button. You will see this screen: The parameters to insert are as follows: Name:  The name of the queue. JNDI Names: The jndi names the queue will be bound to. Durable?: Whether the queue is durable or not. Selector: The queue selector.     As for all enterprise components, JMS components are callable through Java Naming Directory Interface (JNDI).  Durable queues keep messages around persistently for any suitable consumer to consume them. Durable queues do not need to concern themselves with which consumer is going to consume the messages at some point in the future. There is just one copy of a message that any consumer in the future can consume. Message Selectors allows to filter the messages that a Message Consumer will receive. The filter is a relatively complex language similar to the syntax of an SQL WHERE clause. The selector can use all the message headers and properties for filtering operations, but cannot use the message content.Selectors are mostly useful for channels that broadcast a very large number of messages to its subscribers. On Queues, only messages that match the selector will be returned. Others stay in the queue (and thus can be read by a MessageConsumer with different selector). The following SQL elements are allowed in our filters and we can put them in the Selector field of the form:  Element  Description of the Element  Example of Selectors  AND, OR, NOT  Logical operators  (releaseYear < 1986) ANDNOT (title = 'Bad')  String Literals  String literals in single quotes, duplicate to escape  title = 'Tom''s'  Number Literals  Numbers in Java syntax. They can be double or integer  releaseYear = 1982  Properties  Message properties that follow Java identifier naming  releaseYear = 1983  Boolean Literals  TRUE and FALSE  isAvailable = FALSE  ( )  Round brackets  (releaseYear < 1981) OR (releaseYear > 1990) BETWEEN Checks whether number is in range (both numbers inclusive) releaseYear BETWEEN 1980 AND 1989 Header Fields Any headers except JMSDestination, JMSExpiration and JMSReplyTo JMSPriority = 10 =, <>, <, <=, >, >= Comparison operators (releaseYear < 1986) AND (title <> 'Bad') LIKE String comparison with wildcards '_' and '%' title LIKE 'Mirror%' IN Finds value in set of strings title IN ('Piece of mind', 'Somewhere in time', 'Powerslave') IS NULL, IS NOT NULL Checks whether value is null or not null. releaseYear IS NULL *, +, -, / Arithmetic operators releaseYear * 2 > 2000 - 18 Fill the form now: In this article we will implement a messaging service to send coordinates of the bus means . The queue is created and showed in the queues list: Create the queue using CLI and Maven WildFly plugin The same thing can be done with the Command Line Interface (CLI). So start a WildFly instance, go in the bin directory of WildFly and execute the following script: bash-3.2$ ./jboss-cli.sh You are disconnected at the moment. Type 'connect' to connect to the server or 'help' for the list of supported commands. [disconnected /] connect [standalone@localhost:9990 /] /subsystem=messagingactivemq/ server=default/jmsqueue= gps_coordinates:add(entries=["java:/jms/queue/GPS"]) {"outcome" => "success"} The same thing can be done through maven. Simply add this snippet in your pom.xml: <plugin> <groupId>org.wildfly.plugins</groupId> <artifactId>wildfly-maven-plugin</artifactId> <version>1.0.2.Final</version> <executions> <execution> <id>add-resources</id> <phase>install</phase> <goals> <goal>add-resource</goal> </goals> <configuration> <resources> <resource> <address>subsystem=messaging-activemq,server=default,jmsqueue= gps_coordinates</address> <properties> <durable>true</durable> <entries>!!["gps_coordinates", "java:/jms/queue/GPS"]</entries> </properties> </resource> </resources> </configuration> </execution> <execution> <id>del-resources</id> <phase>clean</phase> <goals> <goal>undeploy</goal> Queues and topics [ 7 ] </goals> <configuration> <afterDeployment> <commands> <command>/subsystem=messagingactivemq/ server=default/jms-queue=gps_coordinates:remove </command> </commands> </afterDeployment> </configuration> </execution> </executions> </plugin> The Maven WildFly plugin lets you to do admin operations in WildFly using the same custom protocol used by command line. Two executions are configured: add-resources: It hooks the install maven scope and it adds the queue passing the name, JNDI and durable parameters seen in the previous paragraph. del-resources: It hooks the clean maven scope and remove the chosen queue by name. Create the queue through an Arquillian test case Or we can add and remove the queue through an Arquillian test case: @RunWith(Arquillian.class) @ServerSetup(MessagingResourcesSetupTask.class) public class MessageTestCase { ... private static final String QUEUE_NAME = "gps_coordinates"; private static final String QUEUE_LOOKUP = "java:/jms/queue/GPS"; static class MessagingResourcesSetupTask implements ServerSetupTask { @Override public void setup(ManagementClient managementClient, String containerId) throws Exception { getInstance(managementClient.getControllerClient()).createJmsQueue(QUEUE_NA ME, QUEUE_LOOKUP); } @Override public void tearDown(ManagementClient managementClient, String containerId) throws Exception { getInstance(managementClient.getControllerClient()).removeJmsQueue(QUEUE_NA ME); } } Queues and topics [ 8 ] ... } The Arquillian org.jboss.as.arquillian.api.ServerSetup annotation let to use an external setup manager used to install or remove new components inside WildFly. In this case we are installing the queue declared with the two variables QUEUE_NAME and QUEUE_LOOKUP. When the test ends, automatically the tearDown method will be started and it will remove the installed queue. To use Arquillian it's important add the WildFly testsuite dependency in your pom.xml project: ... <dependencies> <dependency> <groupId>org.wildfly</groupId> <artifactId>wildfly-testsuite-shared</artifactId> <version>10.1.0.Final</version> <scope>test</scope> </dependency> </dependencies> ... Going in the standalone-full.xml we will find the created queue as: <subsystem > <server name="default"> ... <jms-queue name="gps_coordinates" entries="java:/jms/queue/GPS"/> ... </server> </subsystem> JMS is available in the standalone-full configuration. By default WildFly supports 4 standalone configurations. They can be found in the standalone/configuration directory: standalone.xml: It supports all components except the messaging and corba/iiop standalone-full.xml: It supports all components standalone-ha.xml: It supports all components except the messaging and corba/iiop with the enabled cluster standalone-full-ha.xml: It supports all components with the enabled cluster To start WildFly with the chosen configuration simply add a -c with the configuration in the standalone.sh script. Here a sample to start the standalone full configuration: ./standalone.sh -c standalone-full.xml Create the java client for the queue See now how create a client to send a message to the queue. JMS 2.0 simplify very much the creation of clients. Here a sample of a client inside a stateless Enterprise Java Beans (EJB): @Stateless public class MessageQueueSender { @Inject private JMSContext context; @Resource(mappedName = "java:/jms/queue/GPS") private Queue queue; public void sendMessage(String message) { context.createProducer().send(queue, message); } } The javax.jms.JMSContext is injectable from any EE component. We will see the JMS context in details in the next paragraph The JMS Context. The queue is represented in JMS by the javax.jms.Queue class. It can be injected as JNDI resource through the @Resource annotation. The JMS context through the createProducer method creates a producer represented by the javax.jms.JMSProducer class used to send the messages. We can now create a client injecting the stateless and sending a string message hello! ... @EJB private MessageQueueSender messageQueueSender; ... messageQueueSender.sendMessage("hello!"); Summary In this article we have seen how to implement Java Message Service in a queue channel using web console, Command Line Interface and Maven WildFly plugins, Arquillian test cases and how to create Java clients for queue. Resources for Article: Further resources on this subject: WildFly – the Basics [article] WebSockets in Wildfly [article] Creating Java EE Applications [article]
Read more
  • 0
  • 0
  • 15006

article-image-thread-execution
Packt
10 Jul 2017
6 min read
Save for later

Thread of Execution

Packt
10 Jul 2017
6 min read
In this article by Anton Polukhin Alekseevic, the author of the book Boost C++ Application Development Cookbook - Second Edition, we will see the multithreading concept.  Multithreading means multiple threads of execution exist within a single process. Threads may share process resources and have their own resources. Those threads of execution may run independently on different CPUs, leading to faster and more responsible programs. Let's see how to create a thread of execution. (For more resources related to this topic, see here.) Creating a thread of execution On modern multicore compilers, to achieve maximal performance (or just to provide a good user experience), programs usually must use multiple threads of execution. Here is a motivating example in which we need to create and fill a big file in a thread that draws the user interface: #include <algorithm> #include <fstream> #include <iterator> bool is_first_run(); // Function that executes for a long time. void fill_file(char fill_char, std::size_t size, const char* filename); // ... // Somewhere in thread that draws a user interface: if (is_first_run()) { // This will be executing for a long time during which // users interface freezes. fill_file(0, 8 * 1024 * 1024, "save_file.txt"); } Getting ready This recipe requires knowledge of boost::bind or std::bind. How to do it... Starting a thread of execution never was so easy: #include <boost/thread.hpp> // ... // Somewhere in thread that draws a user interface: if (is_first_run()) { boost::thread(boost::bind( &fill_file, 0, 8 * 1024 * 1024, "save_file.txt" )).detach(); } How it works... The boost::thread variable accepts a functional object that can be called without parameters (we provided one using boost::bind) and creates a separate thread of execution. That functional object is copied into a constructed thread of execution and run there. We are using version 4 of the Boost.Thread in all recipes (defined BOOST_THREAD_VERSION to 4). Important differences between Boost.Thread versions are highlighted. After that, we call the detach() function, which does the following: The thread of execution is detached from the boost::thread variable but continues its execution The boost::thread variable start to hold a Not-A-Thread state Note that without a call to detach(), the destructor of boost::thread will notice that it still holds a OS thread and will call std::terminate.  std::terminate terminates our program without calling destructors, freeing up resources, and without other cleanup. Default constructed threads also have a Not-A-Thread state, and they do not create a separate thread of execution. There's more... What if we want to make sure that file was created and written before doing some other job? In that case we need to join the thread in the following way: // ... // Somewhere in thread that draws a user interface: if (is_first_run()) { boost::thread t(boost::bind( &fill_file, 0, 8 * 1024 * 1024, "save_file.txt" )); // Do some work. // ... // Waiting for thread to finish. t.join(); } After the thread is joined, the boost::thread variable holds a Not-A-Thread state and its destructor does not call std::terminate. Remember that the thread must be joined or detached before its destructor is called. Otherwise, your program will terminate! With BOOST_THREAD_VERSION=2 defined, the destructor of boost::thread calls detach(), which does not lead to std::terminate. But doing so breaks compatibility with std::thread, and some day, when your project is moving to the C++ standard library threads or when BOOST_THREAD_VERSION=2 won't be supported; this will give you a lot of surprises. Version 4 of Boost.Thread is more explicit and strong, which is usually preferable in C++ language. Beware that std::terminate() is called when any exception that is not of type boost::thread_interrupted leaves boundary of the functional object that was passed to the boost::thread constructor. There is a very helpful wrapper that works as a RAII wrapper around the thread and allows you to emulate the BOOST_THREAD_VERSION=2 behavior; it is called boost::scoped_thread<T>, where T can be one of the following classes: boost::interrupt_and_join_if_joinable: To interrupt and join thread at destruction boost::join_if_joinable: To join a thread at destruction boost::detach: To detach a thread at destruction Here is a small example: #include <boost/thread/scoped_thread.hpp> void some_func(); void example_with_raii() { boost::scoped_thread<boost::join_if_joinable> t( boost::thread(&some_func) ); // 't' will be joined at scope exit. } The boost::thread class was accepted as a part of the C++11 standard and you can find it in the <thread> header in the std:: namespace. There is no big difference between the Boost's version 4 and C++11 standard library versions of the thread class. However, boost::thread is available on the C++03 compilers, so its usage is more versatile. There is a very good reason for calling std::terminate instead of joining by default! C and C++ languages are often used in life critical software. Such software is controlled by other software, called watchdog. Those watchdogs can easily detect that application has terminated but not always can detect deadlocks or detect them with bigger delays. For example for a defibrillator software it's safer to terminate, than hang on join() for a few seconds waiting for a watchdog reaction. Keep that in mind while designing such applications. See also All the recipes in this chapter are using Boost.Thread. You may continue reading to get more information about the library. The official documentation has a full list of the boost::thread methods and remarks about their availability in the C++11 standard library. The official documentation can be found at http://boost.org/libs/thread. The Interrupting a thread recipe will give you an idea of what the boost::interrupt_and_join_if_joinable class does. Summary We saw how to create a thread of execution using some easy techniques. Resources for Article: Further resources on this subject: Introducing the Boost C++ Libraries [article] Boost.Asio C++ Network Programming [article] Application Development in Visual C++ - The Tetris Application [article]
Read more
  • 0
  • 0
  • 34258
article-image-when-do-we-use-r-over-python
Packt
10 Jul 2017
16 min read
Save for later

When do we use R over Python?

Packt
10 Jul 2017
16 min read
In the article Prabhanjan Tattar, author of book Practical Data Science Cookbook - Second Edition, explainsPython is an interpreted language (sometimes referred to as a scripting language), much like R. It requires no special IDE or software compilation tools and is therefore as fast as R to develop with and prototype. Like R, it also makes use of C shared objects to improve computational performance. Additionally, Python is a default system tool on Linux, Unix, and Mac OS X machines and is available on Windows. Python comes with batteries included, which means that the standard library is widely inclusive of many modules, from multiprocessing to compression toolsets. Python is a flexible computing powerhouse that can tackle any problem domain. If you find yourself in need of libraries that are outside of the standard library, Python also comes with a package manager (like R) that allows the download and installation of other code bases. (For more resources related to this topic, see here.) Python’s computational flexibility means that some analytical tasks take more lines of code than their counterpart in R. However, Python does have the tools that allow it to perform the same statistical computing. This leads to an obvious question: When do we use R over Python and vice versa? This article attempts to answer this question by taking an application-oriented approach to statistical analyses. From books to movies to people to follow on Twitter, recommender systems carve the deluge of information on the Internet into a more personalized flow, thus improving the performance of e-commerce, web, and social applications. It is no great surprise, given the success of Amazon-monetizing recommendations and the Netflix Prize, that any discussion of personalization or data-theoretic prediction would involve a recommender. What is surprising is how simple recommenders are to implement yet how susceptible they are to vagaries of sparse data and overfitting. Consider a non-algorithmic approach to eliciting recommendations; one of the easiest ways to garner a recommendation is to look at the preferences of someone we trust. We are implicitly comparing our preferences to theirs, and the more similarities you share, the more likely you are to discover novel, shared preferences. However, everyone is unique, and our preferences exist across a variety of categories and domains. What if you could leverage the preferences of a great number of people and not just those you trust? In the aggregate, you would be able to see patterns, not just of people like you, but also anti-recommendations— things to stay away from, cautioned by the people not like you. You would, hopefully, also see subtle delineations across the shared preference space of groups of people who share parts of your own unique experience. Understanding the data Understanding your data is critical to all data-related work. In this recipe, we acquire and take a first look at the data that we will be using to build our recommendation engine. Getting ready To prepare for this recipe, and the rest of the article, download the MovieLens data from the GroupLens website of the University of Minnesota. You can find the data at http://grouplens.org/datasets/movielens/. In this recipe, we will use the smaller MoveLens 100k dataset (4.7 MB in size) in order to load the entire model into the memory with ease. How to do it… Perform the following steps to better understand the data that we will be working with throughout: Download the data from http://grouplens.org/datasets/movielens/.The 100K dataset is the one that you want (ml-100k.zip): Unzip the downloaded data into the directory of your choice. The two files that we are mainly concerned with are u.data, which contains the user movie ratings, and u.item, which contains movie information and details. To get a sense of each file, use the head command at the command prompt for Mac and Linux or the more command for Windows: head -n 5 u.item Note that if you are working on a computer running the Microsoft Windows operating system and not using a virtual machine (not recommended), you do not have access to the head command; instead, use the following command: moreu.item 2 n The preceding command gives you the following output: 1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0 2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0 5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0 The following command will produce the given output: head -n 5 u.data For Windows, you can use the following command: moreu.item 2 n 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 How it works… The two main files that we will be using are as follows: u.data: This contains the user moving ratings u.item: This contains the movie information and other details Both are character-delimited files; u.data, which is the main file, is tab delimited, and u.item is pipe delimited. For u.data, the first column is the user ID, the second column is the movie ID, the third is the star rating, and the last is the timestamp. The u.item file contains much more information, including the ID, title, release date, and even a URL to IMDB. Interestingly, this file also has a Boolean array indicating the genre(s) of each movie, including (in order) action, adventure, animation, children, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western. There’s more… Free, web-scale datasets that are appropriate for building recommendation engines are few and far between. As a result, the movie lens dataset is a very popular choice for such a task but there are others as well. The well-known Netflix Prize dataset has been pulled down by Netflix. However, there is a dump of all user-contributed content from the Stack Exchange network (including Stack Overflow) available via the Internet Archive (https://archive.org/details/stackexchange). Additionally, there is a book-crossing dataset that contains over a million ratings of about a quarter million different books (http://www2.informatik.uni-freiburg.de/~cziegler/BX/). Ingesting the movie review data Recommendation engines require large amounts of training data in order to do a good job, which is why they’re often relegated to big data projects. However, to build a recommendation engine, we must first get the required data into memory and, due to the size of the data, must do so in a memory-safe and efficient way. Luckily, Python has all of the tools to get the job done, and this recipe shows you how. Getting ready You will need to have the appropriate movie lens dataset downloaded, as specified in the preceding recipe. If you skipped the setup in you will need to go back and ensure that you have NumPy correctly installed. How to do it… The following steps guide you through the creation of the functions that we will need in order to load the datasets into the memory: Open your favorite Python editor or IDE. There is a lot of code, so it should be far simpler to enter directly into a text file than Read-Eval-Print Loop (REPL). We create a function to import the movie reviews: In [1]: import csv ...: import datetime In [2]: defload_reviews(path, **kwargs): ...: “““ ...: Loads MovieLens reviews ...: “““ ...: options = { ...: ‘fieldnames’: (‘userid’, ‘movieid’, ‘rating’, ‘timestamp’), ...: ‘delimiter’: ‘t’, ...: } ...: options.update(kwargs) ...: ...: parse_date = lambda r,k: datetime.fromtimestamp(float(r[k])) ...: parse_int = lambda r,k: int(r[k]) ...: ...: with open(path, ‘rb’) as reviews: ...: reader = csv.DictReader(reviews, **options) ...: for row in reader: ...: row[‘movieid’] = parse_int(row, ‘movieid’) ...: row[‘userid’] = parse_int(row, ‘userid’) ...: row[‘rating’] = parse_int(row, ‘rating’) ...: row[‘timestamp’] = parse_date(row, ‘timestamp’) ...: yield row We create a helper function to help import the data: In [3]: import os ...: defrelative_path(path): ...: “““ ...: Returns a path relative from this code file ...: “““ ...: dirname = os.path.dirname(os.path.realpath(‘__file__’)) ...: path = os.path.join(dirname, path) ...: return os.path.normpath(path)   We create another function to load the movie information: In [4]: defload_movies(path, **kwargs): ...: ...: options = { ...: ‘fieldnames’: (‘movieid’, ‘title’, ‘release’, ‘video’, ‘url’), ...: ‘delimiter’: ‘|’, ...: ‘restkey’: ‘genre’, ...: } ...: options.update(kwargs) ...: ...: parse_int = lambda r,k: int(r[k]) ...: parse_date = lambda r,k: datetime.strptime(r[k], ‘%d-%b-%Y’) if r[k] else None ...: ...: with open(path, ‘rb’) as movies: ...: reader = csv.DictReader(movies, **options) ...: for row in reader: ...: row[‘movieid’] = parse_int(row, ‘movieid’) ...: row[‘release’] = parse_date(row, ‘release’) ...: row[‘video’] = parse_date(row, ‘video’) ...: yield row Finally, we start creating a MovieLens class that will be augmented later : In [5]: from collections import defaultdict In [6]: class MovieLens(object): ...: “““ ...: Data structure to build our recommender model on. ...: “““ ...: ...: def __init__(self, udata, uitem): ...: “““ ...: Instantiate with a path to u.data and u.item ...: “““ ...: self.udata = udata ...: self.uitem = uitem ...: self.movies = {} ...: self.reviews = defaultdict(dict) ...: self.load_dataset() ...: ...: defload_dataset(self): ...: “““ ...: Loads the two datasets into memory, indexed on the ID. ...: “““ ...: for movie in load_movies(self.uitem): ...: self.movies[movie[‘movieid’]] = movie ...: ...: for review in load_reviews(self.udata): ...: self.reviews[review[‘userid’]][review[‘movieid’]] = review Ensure that the functions have been imported into your REPL or the IPython workspace, and type the following, making sure that the path to the data files is appropriate for your system: In [7]: data = relative_path(‘../data/ml-100k/u.data’) ...: item = relative_path(‘../data/ml-100k/u.item’) ...: model = MovieLens(data, item) How it works… The methodology that we use for the two data-loading functions (load_reviews and load_movies) is simple, but it takes care of the details of parsing the data from the disk. We created a function that takes a path to our dataset and then any optional keywords. We know that we have specific ways in which we need to interact with the csv module, so we create default options, passing in the field names of the rows along with the delimiter, which is t. The options.update(kwargs) line means that we’ll accept whatever users pass to this function. We then created internal parsing functions using a lambda function in Python. These simple parsers take a row and a key as input and return the converted input. This is an example of using lambda as internal, reusable code blocks and is a common technique in Python. Finally, we open our file and create a csv.DictReader function with our options. Iterating through the rows in the reader, we parse the fields that we want to be int and datetime, respectively, and then yield the row. Note that as we are unsure about the actual size of the input file, we are doing this in a memory-safe manner using Python generators. Using yield instead of return ensures that Python creates a generator under the hood and does not load the entire dataset into the memory. We’ll use each of these methodologies to load the datasets at various times through our computation that uses this dataset. We’ll need to know where these files are at all times, which can be a pain, especially in larger code bases; in the There’s more… section, we’ll discuss a Python pro-tip to alleviate this concern. Finally, we created a data structure, which is the MovieLens class, with which we can hold our reviews’ data. This structure takes the udata and uitem paths, and then, it loads the movies and reviews into two Python dictionaries that are indexed by movieid and userid, respectively. To instantiate this object, you will execute something as follows: In [7]: data = relative_path(‘../data/ml-100k/u.data’) ...: item = relative_path(‘../data/ml-100k/u.item’) ...: model = MovieLens(data, item) Note that the preceding commands assume that you have your data in a folder called data. We can now load the whole dataset into the memory, indexed on the various IDs specified in the dataset. Did you notice the use of the relative_path function? When dealing with fixtures such as these to build models, the data is often included with the code. When you specify a path in Python, such as data/ml-100k/u.data, it looks it up relative to the current working directory where you ran the script. To help ease this trouble, you can specify the paths that are relative to the code itself: importos defrelative_path(path): “““ Returns a path relative from this code file “““ dirname = os.path.dirname(os.path.realpath(‘__file__’)) path = os.path.join(dirname, path) returnos.path.normpath(path) Keep in mind that this holds the entire data structure in memory; in the case of the 100k dataset, this will require 54.1 MB, which isn’t too bad for modern machines. However, we should also keep in mind that we’ll generally build recommenders using far more than just 100,000 reviews. This is why we have configured the data structure the way we have—very similar to a database. To grow the system, you will replace the reviews and movies properties with database access functions or properties, which will yield data types expected by our methods. Finding the highest-scoring movies If you’re looking for a good movie, you’ll often want to see the most popular or best rated movies overall. Initially, we’ll take a naïve approach to compute a movie’s aggregate rating by averaging the user reviews for each movie. This technique will also demonstrate how to access the data in our MovieLens class. Getting ready These recipes are sequential in nature. Thus, you should have completed the previous recipes in the article before starting with this one. How to do it… Follow these steps to output numeric scores for all movies in the dataset and compute a top-10 list: Augment the MovieLens class with a new method to get all reviews for a particular movie: In [8]: class MovieLens(object): ...: ...: ...: defreviews_for_movie(self, movieid): ...: “““ ...: Yields the reviews for a given movie ...: “““ ...: for review in self.reviews.values(): ...: if movieid in review: ...: yield review[movieid] ...: Then, add an additional method to compute the top 10 movies reviewed by users: In [9]: import heapq ...: from operator import itemgetter ...: class MovieLens(object): ...: ...: defaverage_reviews(self): ...: “““ ...: Averages the star rating for all movies. Yields a tuple of movieid, ...: the average rating, and the number of reviews. ...: “““ ...: for movieid in self.movies: ...: reviews = list(r[‘rating’] for r in self.reviews_for_movie(movieid)) ...: average = sum(reviews) / float(len(reviews)) ...: yield (movieid, average, len(reviews)) ...: ...: deftop_rated(self, n=10): ...: “““ ...: Yields the n top rated movies ...: “““ ...: return heapq.nlargest(n, self.bayesian_average(), key=itemgetter(1)) ...: Note that the … notation just below class MovieLens(object): signifies that we will be appending the average_reviews method to the existing MovieLens class. Now, let’s print the top-rated results: In [10]: for mid, avg, num in model.top_rated(10): ...: title = model.movies[mid][‘title’] ...: print “[%0.3f average rating (%i reviews)] %s” % (avg, num,title) Executing the preceding commands in your REPL should produce the following output: Out [10]: [5.000 average rating (1 reviews)] Entertaining Angels: The Dorothy Day Story (1996) [5.000 average rating (2 reviews)] Santa with Muscles (1996) [5.000 average rating (1 reviews)] Great Day in Harlem, A (1994) [5.000 average rating (1 reviews)] They Made Me a Criminal (1939) [5.000 average rating (1 reviews)] Aiqingwansui (1994) [5.000 average rating (1 reviews)] Someone Else’s America (1995) [5.000 average rating (2 reviews)] Saint of Fort Washington, The (1993) [5.000 average rating (3 reviews)] Prefontaine (1997) [5.000 average rating (3 reviews)] Star Kid (1997) [5.000 average rating (1 reviews)] Marlene Dietrich: Shadow and Light (1996) How it works… The new reviews_for_movie() method that is added to the MovieLens class iterates through our review dictionary values (which are indexed by the userid parameter), checks whether the movieid value has been reviewed by the user, and then presents that review dictionary. We will need such functionality for the next method. With the average_review() method, we have created another generator function that goes through all of our movies and all of their reviews and presents the movie ID, the average rating, and the number of reviews. The top_rated function uses the heapq module to quickly sort the reviews based on the average. The heapq data structure, also known as the priority queue algorithm, is the Python implementation of an abstract data structure with interesting and useful properties. Heaps are binary trees that are built so that every parent node has a value that is either less than or equal to any of its children nodes. Thus, the smallest element is the root of the tree, which can be accessed in constant time, which is a very desirable property. With heapq, Python developers have an efficient means to insert new values in an ordered data structure and also return sorted values. There’s more… Here, we run into our first problem—some of the top-rated movies only have one review (and conversely, so do the worst-rated movies). How do you compare Casablanca, which has a 4.457 average rating (243 reviews), with Santa with Muscles, which has a 5.000 average rating (2 reviews)? We are sure that those two reviewers really liked Santa with Muscles, but the high rating for Casablanca is probably more meaningful because more people liked it. Most recommenders with star ratings will simply output the average rating along with the number of reviewers, allowing the user to determine their quality; however, as data scientists, we can do better in the next recipe. See also The heapq documentation available at https://docs.python.org/2/library/heapq.html We have thus pointed out that companies such as Amazon track purchases and page views to make recommendations, Goodreads and Yelp use 5 star ratings and text reviews, and sites such as Reddit or Stack Overflow use simple up/down voting. You can see that preference can be expressed in the data in different ways, from Boolean flags to voting to ratings. However, these preferences are expressed by attempting to find groups of similarities in preference expressions in which you are leveraging the core assumption of collaborative filtering. More formally, we understand that two people, Bob and Alice, share a preference for a specific item or widget. If Alice too has a preference for a different item, say, sprocket, then Bob has a better than random chance of also sharing a preference for a sprocket. We believe that Bob and Alice’s taste similarities can be expressed in an aggregate via a large number of preferences, and by leveraging the collaborative nature of groups, we can filter the world of products. Summary In the recipes we learned various ways for understanding data and finding highest scoring reviews using IPython.  Resources for Article: Further resources on this subject: The Data Science Venn Diagram [article] Python Data Science Up and Running [article] Data Science with R [article]
Read more
  • 0
  • 0
  • 5425

article-image-android-uis-custom-views
Packt
10 Jul 2017
11 min read
Save for later

Android UIs with Custom Views

Packt
10 Jul 2017
11 min read
In this article by Raimon Rafols Montane, the author of the book Building Android UIs with Custom Views, we will see the very basics steps we'll need to get ourselves started building Android custom views, where we should use them and where we should simply rely on the Android standard widgets. (For more resources related to this topic, see here.) Why do we need custom views: There are lovely Android applications on Google Play and in other markets such as Amazon, built only using the standard Android UI widgets and layouts. There are also many other applications which have that small additional feature that makes our interaction with them easier or simply more pleasing. There is no magic formula, but maybe by just adding something different, something that the user feels like "hey it's not just another app for" might increase our user retention. It might not be the deal breaker but can definitely make the difference sometimes. Few years ago, author was working on the pre-sales department of a mobile app development agency and when the app Path was launched to the market, we got many and many requests from our customers to build menus like the Path app menu. It wasn't a critical feature for the applications they were building, but at that point it was a cool thing and many other applications and products wanted to have something similar. Even if it was a simple detail, it made the Path application special at that time. It wasn't the case at that time, but nowadays there are many implementations of that menu published in GitHub as open source, although they're not really maintained anymore. https://github.com/daCapricorn/ArcMenu https://github.com/oguzbilgener/CircularFloatingActionMenu. One of the main reasons to create our own custom views to our mobile application is, precisely, to have something special. It might be a menu, a component, a screen, something that might be really needed or even the main functionality for our application or just as an additional feature. In addition, by creating our custom view we can actually optimize the performance of our application. We can create a specific way of layouting widgets that otherwise will need many hierarchy layers by just using standard Android layouts or a custom view that simplifies rendering or user interaction. On the other hand, we can easily fall in the error of trying to build everything custom. Android provides an awesome list of widget and layout components that manages a lot of things for ourselves. If we ignore the basic Android framework and try to build everything by ourselves it would be a lot of work, potentially struggling with a lot of issues and errors that the Android OS developers already faced or, at least, very similar ones and, to put it up in one sentence, we would be reinventing the wheel. Examples in the market We all probably use great apps that are built only using the standard Android UI widgets and layouts, but there are many others that have some custom views that we don't know or we haven't really noticed. The custom views or layouts can sometimes be very subtle and hard to spot. We'd not be the first ones to have a custom view or layout in our application. In fact, many popular apps have some custom elements on them. For example, the Etsy application had a custom layout called StaggeredGridView. It was even published as open source in GitHub. It has been deprecated since 2015 in favor of Google's own StaggeredGridLayoutManager used together with RecyclerView. More information refer to: https://github.com/etsy/AndroidStaggeredGrid https://developer.android.com/reference/android/support/v7/widget/StaggeredGridLayoutManager.html. Parameterizing our custom view We have our custom view that adapts to multiple sizes now, that's good, but, what happens if we need another custom view that paints the background blue instead of red? And yellow? we shouldn't have to copy the custom view class for each customization. Luckily, we can set parameters on the XML layout and read them from our custom view. First, we need to define the type of parameters we will use on our custom view. We've to create a file called attrs.xml: <?xml version="1.0" encoding="utf-8"?> <resources> <declare-styleable name="OwnCustomView"> <attr name="fillColor" format="color"/> </declare-styleable> </resources> Then, we have to add a new different namespace on our layout file where we want to use following new parameter: <?xml version="1.0" encoding="utf-8"?> <ScrollView android_orientation="vertical" android_layout_width="match_parent" android_layout_height="match_parent"> <LinearLayout android_layout_width="match_parent" android_layout_height="wrap_content" android_orientation="vertical" android_padding="@dimen/activity_vertical_margin"> <com.packt.rrafols.customview.OwnCustomView android_layout_width="match_parent" android_layout_height="wrap_content" app_fillColor="@android:color/holo_blue_dark"/> </LinearLayout> </ScrollView> Now that we have this defined, let's see how we can read it from our custom view class. int fillColor; TypedArray ta = context.getTheme().obtainStyledAttributes(attributeSet, R.styleable.OwnCustomView, 0, 0); try { fillColor = ta.getColor(R.styleable.OwnCustomView_fillColor, DEFAULT_FILL_COLOR); } finally { ta.recycle(); } By getting a TypedArray using the styled attribute ID Android tools created for us after saving the attrs.xml file, we'll be able to query for the value of those parameters set on the XML layout file. More information about how to obtain a TypedArray can be found on the Android documentation: https://developer.android.com/reference/android/content/res/Resources.Theme.html#obtainStyledAttributes(android.util.AttributeSet, int[], int, int). For more information about TypedArray refer to: https://developer.android.com/reference/android/content/res/TypedArray.html. In this example, we created an attribute named fillColor which will be a formatted as a color. This format, or basically, the type of the attribute is very important to limit the kind of values we can set, and how these values can be retrieved afterwards from our custom view. Also, for each parameter we define, we'll get a R.styleable.<name>_<parameter_name> index in the TypedArray. In the code above, we're querying for the fillColor using the R.styleable.OwnCustomView_fillColor index. We shouldn't forget to recycle the TypedArray after using it so it can be reused, but once recycled, we can't use it again. Many custom views will only need to draw something in a special way, that's the reason we created them as custom views, but many others will need to react to user events. For example, how our custom view will behave when the user clicks or drags on top of it? Basic event handling Let's start by adding some basic event handling to our custom views. Reacting to touches One of the first things we'd to implement is to react to touch events. Android provides us with the method onTouchEvent() that we can override in our custom view. Overriding this method, we'll get any touch event happening on top of it. To see how it works, let's add it to the custom view: @Override public boolean onTouchEvent(MotionEvent event) { Log.d(TAG, "touch: " + event); return super.onTouchEvent(event); } Let's also add a log call to see what events do we receive. If we run this code and we touch on top of our view, we'll get: D/com.packt.rrafols.customview.CircularActivityIndicator: touch: MotionEvent { action=ACTION_DOWN, actionButton=0, id[0]=0, x[0]=644.3645, y[0]=596.55804, toolType[0]=TOOL_TYPE_FINGER, buttonState=0, metaState=0, flags=0x0, edgeFlags=0x0, pointerCount=1, historySize=0, eventTime=30656461, downTime=30656461, deviceId=9, source=0x1002 } As we can see there is a lot of information on the event, the coordinates, the action type, the time, but even if we perform more actions on it, we'll only get ACTION_DOWN events. That's because the default implementation of view is not clickable. If we don't set the clickable flag on the view, onTouchEvent() will return false and ignore further events. Method onTouchEvent() has to return true if the event has been processed or false if it hasn't. If we receive an event in our custom view and we don't know what to do it or we're not interested in that kind of events, we should return false so it can be processed by our view's parent or by any other component or the system. To receive more type of events, we can do two things: Set the view as clickable by using setClickable(true) Implement our own logic and process the events ourselves in our custom class. As, we'll implement more complex events, we'll go for the second option. Let's do a quick test, change the method to return simply true instead of calling the parent method: @Override public boolean onTouchEvent(MotionEvent event) { Log.d(TAG, "touch: " + event); return true; } Now we should receive many other types of events as follows: ...CircularActivityIndicator: touch: MotionEvent { action=ACTION_DOWN, ...CircularActivityIndicator: touch: MotionEvent { action=ACTION_UP, ...CircularActivityIndicator: touch: MotionEvent { action=ACTION_DOWN, ...CircularActivityIndicator: touch: MotionEvent { action=ACTION_MOVE, ...CircularActivityIndicator: touch: MotionEvent { action=ACTION_MOVE, ...CircularActivityIndicator: touch: MotionEvent { action=ACTION_MOVE, ...CircularActivityIndicator: touch: MotionEvent { action=ACTION_UP, ...CircularActivityIndicator: touch: MotionEvent { action=ACTION_DOWN, Using the Paint class We've been drawing some primitives until now, but Canvas provides us with many more primitive rendering methods. We'll briefly cover some of them, but before let's first talk about the Paint class as we haven't introduced it properly. According to the official definition the Paint class hold the style and color information about how to draw primitives, text and bitmaps. If we check the examples we've been building, we created a Paint object on our class constructor, or on the onCreate method, and we used it to draw primitives later on our onDraw method. As, for instance, we set our background Paint instance Style to Paint.Style.FILL, it'll fill the primitive, but we can change it to Paint.Style.STROKE if we only want to draw the border or the strokes of the silhouette. We can draw both using Paint.Style.FILL_AND_STROKE. To see the Paint.Style.STROKE in action, we'll draw a black border on top of our selected colored bar in our custom view. Let's start by defining a new Paint object called indicatorBorderPaint and initialize it on our class constructor: indicatorBorderPaint = new Paint(); indicatorBorderPaint.setAntiAlias(false); indicatorBorderPaint.setColor(BLACK_COLOR); indicatorBorderPaint.setStyle(Paint.Style.STROKE); indicatorBorderPaint.setStrokeWidth(BORDER_SIZE); indicatorBorderPaint.setStrokeCap(Paint.Cap.BUTT); We also defined a constant with the size of the border line and set the stroke width to this size. If we set the width to 0, Android guaranties it'll use a single pixel to draw the line. As we want to draw a thick black border, this is not our case right now. In addition, we set the stroke cap to Paint.Cap.BUTT to avoid the stroke overflowing its path. There are two more Caps we can use, Paint.Cap.SQUARE and Paint.Cap.ROUND. These last two will end the stroke respectively with a circle, rounding the stroke, or a square. Let's see the differences between the three Caps and also introduce the drawLine primitive. First of all, we create an array with all three Cap, so we can easily iterate between them and create a more compact code: private Paint.Cap[] caps = new Paint.Cap[] { Paint.Cap.BUTT, Paint.Cap.ROUND, Paint.Cap.SQUARE }; Now, on our onDraw method, let's draw a line using each of the Cap using the drawLine(float startX, float startY, float stopX, float stopY, Paint paint) method. int xPos = (getWidth() - 100) / 2; int yPos = getHeight() / 2 - BORDER_SIZE * CAPS.length / 2; for(int i = 0; i < CAPS.length; i++) { indicatorBorderPaint.setStrokeCap(CAPS[i]); canvas.drawLine(xPos, yPos, xPos + 100, yPos, indicatorBorderPaint); yPos += BORDER_SIZE * 2; } indicatorBorderPaint.setStrokeCap(Paint.Cap.BUTT); Summary In this article we have seen the reasoning behind why we might want to build custom views and layouts. Android provides a great basic framework for creating UIs and not using it would be a mistake. Not every component, button or widget has to be developed completely custom but, by doing it so in the right spot, we can add an extra feature that might make our application remembered. We have also seen basics of event handling. Resources for Article: Further resources on this subject: Offloading work from the UI Thread on Android [article] Getting started with Android Development [article] Practical How-To Recipes for Android [article]
Read more
  • 0
  • 0
  • 2811

article-image-spark-streaming
Packt
06 Jul 2017
11 min read
Save for later

Spark Streaming

Packt
06 Jul 2017
11 min read
In this article by Romeo Kienzler, the author of the book Mastering Apache Spark 2.x - Second Edition, we will see Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault tolerant, having the ability to rerun failed tasks by check-pointing the data stream that is being processed. The following areas will be covered in this article after an initial section, which will provide a practical overview of how Apache Spark processes stream-based data: Error recovery and check-pointing TCP-based stream processing File streams Kafka stream source For each topic, we will provide a worked example in Scala, and will show how the stream-based architecture can be set up and tested. (For more resources related to this topic, see here.) Overview The following diagram shows potential data sources for Apache Streaming, such as Kafka, Flume, and HDFS: These feed into the Spark Streaming module, and are processed as Discrete Streams. The diagram also shows that other Spark module functionality, such as machine learning, can be used to process the stream-based data. The fully processed data can then be an output for HDFS, databases, or dashboards. This diagram is based on the one at the Spark streaming website, but we wanted to extend it for expressing the Spark module functionality:  When discussing Spark Discrete Streams, the previous figure, again taken from the Spark website at http://spark.apache.org/, is the diagram we like to use. The green boxes in the previous figure show the continuous data stream sent to Spark, being broken down into a Discrete Streams (DStream). The size of each element in the stream is then based on a batch time, which might be two seconds. It is also possible to create a window, expressed as the previous red box, over the DStream. For instance, when carrying out trend analysis in real time, it might be necessary to determine the top ten Twitter-based hashtags over a ten minute window. So, given that Spark can be used for stream processing, how is a stream created? The following Scala-based code shows how a Twitter stream can be created. This example is simplified because Twitter authorization has not been included, but you get the idea. The Spark Stream Context (SSC) is created using the Spark Context sc. A batch time is specified when it is created; in this case, 5 seconds. A Twitter-based DStream, called stream, is then created from the Streamingcontext using a window of 60 seconds: val ssc = new StreamingContext(sc, Seconds(5) ) val stream = TwitterUtils.createStream(ssc,None).window( Seconds(60) ) The stream processing can be started with the stream context start method (shown next), and the awaitTermination method indicates that it should process until stopped. So, if this code is embedded in a library-based application, it will run until the session is terminated, perhaps with a Crtl + C: ssc.start() ssc.awaitTermination() This explains what Spark Streaming is, and what it does, but it does not explain error handling, or what to do if your stream-based application fails. The next section will examine Spark Streaming error management and recovery. Errors and recovery Generally, the question that needs to be asked for your application is; is it critical that you receive and process all the data? If not, then on failure you might just be able to restart the application and discard the missing or lost data. If this is not the case, then you will need to use check pointing, which will be described in the next section. It is also worth noting that your application's error management should be robust and self-sufficient. What we mean by this is that; if an exception is non-critical, then manage the exception, perhaps log it, and continue processing. For instance, when a task reaches the maximum number of failures (specified by spark.task.maxFailures), it will terminate processing. Checkpointing It is possible to set up an HDFS-based checkpoint directory to store Apache Spark-based streaming information. In this Scala example, data will be stored in HDFS, under /data/spark/checkpoint. The following HDFS file system ls command shows that before starting, the directory does not exist: [hadoop@hc2nn stream]$ hdfs dfs -ls /data/spark/checkpoint ls: `/data/spark/checkpoint': No such file or directory The Twitter-based Scala code sample given next, starts by defining a package name for the application, and by importing Spark Streaming Context, and Twitter-based functionality. It then defines an application object named stream1: package nz.co.semtechsolutions import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter._ import org.apache.spark.streaming.StreamingContext._ object stream1 { Next, a method is defined called createContext, which will be used to create both the spark, and streaming contexts. It will also checkpoint the stream to the HDFS-based directory using the streaming context checkpoint method, which takes a directory path as a parameter. The directory path being the value (cpDir) that was passed into the createContext method:   def createContext( cpDir : String ) : StreamingContext = { val appName = "Stream example 1" val conf = new SparkConf() conf.setAppName(appName) val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(5) ) ssc.checkpoint( cpDir ) ssc } Now, the main method is defined, as is the HDFS directory, as well as Twitter access authority and parameters. The Spark Streaming context ssc is either retrieved or created using the HDFS checkpoint directory via the StreamingContext method—getOrCreate. If the directory doesn't exist, then the previous method called createContext is called, which will create the context and checkpoint. Obviously, we have truncated our own Twitter auth.keys in this example for security reasons: def main(args: Array[String]) { val hdfsDir = "/data/spark/checkpoint" val consumerKey = "QQpxx" val consumerSecret = "0HFzxx" val accessToken = "323xx" val accessTokenSecret = "IlQxx" System.setProperty("twitter4j.oauth.consumerKey", consumerKey) System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret) System.setProperty("twitter4j.oauth.accessToken", accessToken) System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret) val ssc = StreamingContext.getOrCreate(hdfsDir, () => { createContext( hdfsDir ) }) val stream = TwitterUtils.createStream(ssc,None).window( Seconds(60) ) // do some processing ssc.start() ssc.awaitTermination() } // end main Having run this code, which has no actual processing, the HDFS checkpoint directory can be checked again. This time it is apparent that the checkpoint directory has been created, and the data has been stored: [hadoop@hc2nn stream]$ hdfs dfs -ls /data/spark/checkpoint Found 1 items drwxr-xr-x - hadoop supergroup 0 2015-07-02 13:41 /data/spark/checkpoint/0fc3d94e-6f53-40fb-910d-1eef044b12e9 This example, taken from the Apache Spark website, shows how checkpoint storage can be set up and used. But how often is checkpointing carried out? The metadata is stored during each stream batch. The actual data is stored with a period, which is the maximum of the batch interval, or ten seconds. This might not be ideal for you, so you can reset the value using the method: DStream.checkpoint( newRequiredInterval ) Where newRequiredInterval is the new checkpoint interval value that you require, generally you should aim for a value which is five to ten times your batch interval. Checkpointing saves both the stream batch and metadata (data about the data). If the application fails, then when it restarts, the checkpointed data is used when processing is started. The batch data that was being processed at the time of failure is reprocessed, along with the batched data since the failure. Remember to monitor the HDFS disk space being used for check pointing. In the next section, we will begin to examine the streaming sources, and will provide some examples of each type. Streaming sources We will not be able to cover all the stream types with practical examples in this section, but where this article is too small to include code, we will at least provide a description. In this article, we will cover the TCP and file streams, and the Flume, Kafka, and Twitter streams. We will start with a practical TCP-based example. This article examines stream processing architecture. For instance, what happens in cases where the stream data delivery rate exceeds the potential data processing rate? Systems like Kafka provide the possibility of solving this issue by providing the ability to use multiple data topics and consumers. TCP stream There is a possibility of using the Spark Streaming Context method called socketTextStream to stream data via TCP/IP, by specifying a hostname and a port number. The Scala-based code example in this section will receive data on port 10777 that was supplied using the Netcat Linux command. The code sample starts by defining the package name, and importing Spark, the context, and the streaming classes. The object class named stream2 is defined, as it is the main method with arguments: package nz.co.semtechsolutions import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object stream2 { def main(args: Array[String]) { The number of arguments passed to the class is checked to ensure that it is the hostname and the port number. A Spark configuration object is created with an application name defined. The Spark and streaming contexts are then created. Then, a streaming batch time of 10 seconds is set: if ( args.length < 2 ) { System.err.println("Usage: stream2 <host> <port>") System.exit(1) } val hostname = args(0).trim val portnum = args(1).toInt val appName = "Stream example 2" val conf = new SparkConf() conf.setAppName(appName) val sc = new SparkContext(conf) val ssc = new StreamingContext(sc, Seconds(10) ) A DStream called rawDstream is created by calling the socketTextStream method of the streaming context using the host and port name parameters. val rawDstream = ssc.socketTextStream( hostname, portnum ) A top-ten word count is created from the raw stream data by splitting words by spacing. Then a (key,value) pair is created as (word,1), which is reduced by the key value, this being the word. So now, there is a list of words and their associated counts. Now, the key and value are swapped, so the list becomes (count and word). Then, a sort is done on the key, which is now the count. Finally, the top 10 items in the RDD, within the DStream, are taken and printed out: val wordCount = rawDstream .flatMap(line => line.split(" ")) .map(word => (word,1)) .reduceByKey(_+_) .map(item => item.swap) .transform(rdd => rdd.sortByKey(false)) .foreachRDD( rdd => { rdd.take(10).foreach(x=>println("List : " + x)) }) The code closes with the Spark Streaming start, and awaitTermination methods being called to start the stream processing and await process termination: ssc.start() ssc.awaitTermination() } // end main } // end stream2 The data for this application is provided, as we stated previously, by the Linux Netcat (nc) command. The Linux Cat command dumps the contents of a log file, which is piped to nc. The lk options force Netcat to listen for connections, and keep on listening if the connection is lost. This example shows that the port being used is 10777: [root@hc2nn log]# pwd /var/log [root@hc2nn log]# cat ./anaconda.storage.log | nc -lk 10777 The output from this TCP-based stream processing is shown here. The actual output is not as important as the method demonstrated. However, the data shows, as expected, a list of 10 log file words in descending count order. Note that the top word is empty because the stream was not filtered for empty words: List : (17104,) List : (2333,=) List : (1656,:) List : (1603,;) List : (1557,DEBUG) List : (564,True) List : (495,False) List : (411,None) List : (356,at) List : (335,object) This is interesting if you want to stream data using Apache Spark Streaming, based upon TCP/IP from a host and port. But what about more exotic methods? What if you wish to stream data from a messaging system, or via memory-based channels? What if you want to use some of the big data tools available today like Flume and Kafka? The next sections will examine these options, but first I will demonstrate how streams can be based upon files. Summary We could have provided streaming examples for systems like Kinesis, as well as queuing systems, but there was not room in this article. This article has provided practical examples of data recovery via checkpointing in Spark Streaming. It has also touched on the performance limitations of checkpointing and shown that that the checkpointing interval should be set at five to ten times the Spark stream batch interval. Resources for Article: Further resources on this subject: Understanding Spark RDD [article] Spark for Beginners [article] Setting up Spark [article]
Read more
  • 0
  • 0
  • 19571
article-image-ruby-strings
Packt
06 Jul 2017
9 min read
Save for later

Ruby Strings

Packt
06 Jul 2017
9 min read
In this article by Jordan Hudgens, the author of the book Comprehensive Ruby Programming, you'll learn about the Ruby String data type and walk through how to integrate string data into a Ruby program. Working with words, sentences, and paragraphs are common requirements in many applications. Additionally you learn how to: Employ string manipulation techniques using core Ruby methods Demonstrate how to work with the string data type in Ruby (For more resources related to this topic, see here.) Using strings in Ruby A string is a data type in Ruby and contains set of characters, typically normal English text (or whatever natural language you're building your program for), that you would write. A key point for the syntax of strings is that they have to be enclosed in single or double quotes if you want to use them in a program. The program will throw an error if they are not wrapped inside quotation marks. Let's walk through three scenarios. Missing quotation marks In this code I tried to simply declare a string without wrapping it in quotation marks. As you can see, this results in an error. This error is because Ruby thinks that the values are classes and methods. Printing strings In this code snippet we're printing out a string that we have properly wrapped in quotation marks. Please note that both single and double quotation marks work properly. It's also important that you do not mix the quotation mark types. For example, if you attempted to run the code: puts "Name an animal' You would get an error, because you need to ensure that every quotation mark is matched with a closing (and matching) quotation mark. If you start a string with double quotation marks, the Ruby parser requires that you end the string with the matching double quotation marks. Storing strings in variables Lastly in this code snippet we're storing a string inside of a variable and then printing the value out to the console. We'll talk more about strings and string interpolation in subsequent sections. String interpolation guide for Ruby In this section, we are going to talk about string interpolation in Ruby. What is string interpolation? So what exactly is string interpolation? Good question. String interpolation is the process of being able to seamlessly integrate dynamic values into a string. Let's assume we want to slip dynamic words into a string. We can get input from the console and store that input into variables. From there we can call the variables inside of a pre-existing string. For example, let's give a sentence the ability to change based on a user's input. puts "Name an animal" animal = gets.chomp puts "Name a noun" noun= gets.chomp p "The quick brown #{animal} jumped over the lazy #{noun} " Note the way I insert variables inside the string? They are enclosed in curly brackets and are preceded by a # sign. If I run this code, this is what my output will look: So, this is how you insert values dynamically in your sentences. If you see sites like Twitter, it sometimes displays personalized messages such as: Good morning Jordan or Good evening Tiffany. This type of behavior is made possible by inserting a dynamic value in a fixed part of a string and leverages string interpolation. Now, let's use single quotes instead of double quotes, to see what happens. As you'll see, the string was printed as it is without inserting the values for animal and noun. This is exactly what happens when you try using single quotes—it prints the entire string as it is without any interpolation. Therefore it's important to remember the difference. Another interesting aspect is that anything inside the curly brackets can be a Ruby script. So, technically you can type your entire algorithm inside these curly brackets, and Ruby will run it perfectly for you. However, it is not recommended for practical programming purposes. For example, I can insert a math equation, and as you'll see it prints the value out. String manipulation guide In this section we are going to learn about string manipulation along with a number of examples of how to integrate string manipulation methods in a Ruby program. What is string manipulation? So what exactly is string manipulation? It's the process of altering the format or value of a string, usually by leveraging string methods. String manipulation code examples Let's start with an example. Let's say I want my application to always display the word Astros in capital letters. To do that, I simply write: "Astros".upcase Now if I always a string to be in lower case letters I can use the downcase method, like so: "Astros".downcase Those are both methods I use quite often. However there are other string methods available that we also have at our disposal. For the rare times when you want to literally swap the case of the letters you can leverage the swapcase method: "Astros".swapcase And lastly if you want to reverse the order of the letters in the string we can call the reverse method: "Astros".reverse These methods are built into the String data class and we can call them on any string values in Ruby. Method chaining Another neat thing we can do is join different methods together to get custom output. For example, I can run: "Astros".reverse.upcase The preceding code displays the value SORTSA. This practice of combining different methods with a dot is called method chaining. Split, strip, and join guides for strings In this section, we are going to walk through how to use the split and strip methods in Ruby. These methods will help us clean up strings and convert a string to an array so we can access each word as its own value. Using the strip method Let's start off by analyzing the strip method. Imagine that the input you get from the user or from the database is poorly formatted and contains white space before and after the value. To clean the data up we can use the strip method. For example: str = " The quick brown fox jumped over the quick dog " p str.strip When you run this code, the output is just the sentence without the white space before and after the words. Using the split method Now let's walk through the split method. The split method is a powerful tool that allows you to split a sentence into an array of words or characters. For example, when you type the following code: str = "The quick brown fox jumped over the quick dog" p str.split You'll see that it converts the sentence into an array of words. This method can be particularly useful for long paragraphs, especially when you want to know the number of words in the paragraph. Since the split method converts the string into an array, you can use all the array methods like size to see how many words were in the string. We can leverage method chaining to find out how many words are in the string, like so: str = "The quick brown fox jumped over the quick dog" p str.split.size This should return a value of 9, which is the number of words in the sentence. To know the number of letters, we can pass an optional argument to the split method and use the format: str = "The quick brown fox jumped over the quick dog" p str.split(//).size And if you want to see all of the individual letters, we can remove the size method call, like this: p str.split(//) And your output should look like this: Notice, that it also included spaces as individual characters which may or may not be what you want a program to return. This method can be quite handy while developing real-world applications. A good practical example of this method is Twitter. Since this social media site restricts users to 140 characters, this method is sure to be a part of the validation code that counts the number of characters in a Tweet. Using the join method We've walked through the split method, which allows you to convert a string into a collection of characters. Thankfully, Ruby also has a method that does the opposite, which is to allow you to convert an array of characters into a single string, and that method is called join. Let's imagine a situation where we're asked to reverse the words in a string. This is a common Ruby coding interview question, so it's an important concept to understand since it tests your knowledge of how string work in Ruby. Let's imagine that we have a string, such as: str = "backwards am I" And we're asked to reverse the words in the string. The pseudocode for the algorithm would be: Split the string into words Reverse the order of the words Merge all of the split words back into a single string We can actually accomplish each of these requirements in a single line of Ruby code. The following code snippet will perform the task: str.split.reverse.join(' ') This code will convert the single string into an array of strings, for the example it will equal ["backwards", "am", "I"]. From there it will reverse the order of the array elements, so the array will equal: ["I", "am", "backwards"]. With the words reversed, now we simply need to merge the words into a single string, which is where the join method comes in. Running the join method will convert all of the words in the array into one string. Summary In this article, we were introduced to the string data type and how it can be utilized in Ruby. We analyzed how to pass strings into Ruby processes by leveraging string interpolation. We also learned the methods of basic string manipulation and how to find and replace string data. We analyzed how to break strings into smaller components, along with how to clean up string based data. We even introduced the Array class in this article. Resources for Article: Further resources on this subject: Ruby and Metasploit Modules [article] Find closest mashup plugin with Ruby on Rails [article] Building tiny Web-applications in Ruby using Sinatra [article]
Read more
  • 0
  • 0
  • 16622

article-image-command-line-tools
Packt
06 Jul 2017
9 min read
Save for later

Command-Line Tools

Packt
06 Jul 2017
9 min read
In this article by Aaron Torres, author of the book, Go Cookbook, we will cover the following recipes: Using command-line arguments Working with Unix pipes An ANSI coloring application (For more resources related to this topic, see here.) Using command-line arguments This article will expand on other uses for these arguments by constructing a command that supports nested subcommands. This will demonstrate Flagsets and also using positional arguments passed into your application. This recipe requires a main function to run. There are a number of third-party packages for dealing with complex nested arguments and flags, but we'll again investigate doing so using only the standard library. Getting ready You need to perform the following steps for the installation: Download and install Go on your operating system at https://golang.org/doc/install and configure your GOPATH. Open a terminal/console application. Navigate to your GOPATH/src and create a project directory, for example, $GOPATH/src/github.com/yourusername/customrepo. All code will be run and modified from this directory. Optionally, install the latest tested version of the code using the go get github.com/agtorre/go-cookbook/ command. How to do it... From your terminal/console application, create and navigate to the chapter2/cmdargs directory. Copy tests from https://github.com/agtorre/go-cookbook/tree/master/chapter2/cmdargs or use this as an exercise to write some of your own. Create a file called cmdargs.go with the following content: package main import ( "flag" "fmt" "os" ) const version = "1.0.0" const usage = `Usage: %s [command] Commands: Greet Version ` const greetUsage = `Usage: %s greet name [flag] Positional Arguments: name the name to greet Flags: ` // MenuConf holds all the levels // for a nested cmd line argument type MenuConf struct { Goodbye bool } // SetupMenu initializes the base flags func (m *MenuConf) SetupMenu() *flag.FlagSet { menu := flag.NewFlagSet("menu", flag.ExitOnError) menu.Usage = func() { fmt.Printf(usage, os.Args[0]) menu.PrintDefaults() } return menu } // GetSubMenu return a flag set for a submenu func (m *MenuConf) GetSubMenu() *flag.FlagSet { submenu := flag.NewFlagSet("submenu", flag.ExitOnError) submenu.BoolVar(&m.Goodbye, "goodbye", false, "Say goodbye instead of hello") submenu.Usage = func() { fmt.Printf(greetUsage, os.Args[0]) submenu.PrintDefaults() } return submenu } // Greet will be invoked by the greet command func (m *MenuConf) Greet(name string) { if m.Goodbye { fmt.Println("Goodbye " + name + "!") } else { fmt.Println("Hello " + name + "!") } } // Version prints the current version that is // stored as a const func (m *MenuConf) Version() { fmt.Println("Version: " + version) } Create a file called main.go with the following content: package main import ( "fmt" "os" "strings" ) func main() { c := MenuConf{} menu := c.SetupMenu() menu.Parse(os.Args[1:]) // we use arguments to switch between commands // flags are also an argument if len(os.Args) > 1 { // we don't care about case switch strings.ToLower(os.Args[1]) { case "version": c.Version() case "greet": f := c.GetSubMenu() if len(os.Args) < 3 { f.Usage() return } if len(os.Args) > 3 { if.Parse(os.Args[3:]) } c.Greet(os.Args[2]) default: fmt.Println("Invalid command") menu.Usage() return } } else { menu.Usage() return } } Run the go build command. Run the following commands and try a few other combinations of arguments: $./cmdargs -h Usage: ./cmdargs [command] Commands: Greet Version $./cmdargs version Version: 1.0.0 $./cmdargs greet Usage: ./cmdargs greet name [flag] Positional Arguments: name the name to greet Flags: -goodbye Say goodbye instead of hello $./cmdargs greet reader Hello reader! $./cmdargs greet reader -goodbye Goodbye reader! If you copied or wrote your own tests go up one directory and run go test, and ensure all tests pass. How it works... Flagsets can be used to set up independent lists of expected arguments, usage strings, and more. The developer is required to do validation on a number of arguments, parsing in the right subset of arguments to commands, and defining usage strings. This can be error prone and requires a lot of iteration to get it completely correct. The flag package makes parsing arguments much easier and includes convenience methods to get the number of flags, arguments, and more. This recipe demonstrates basic ways to construct a complex command-line application using arguments, including a package-level config, required positional arguments, multi-leveled command usage, and how to split these things into multiple files or packages if needed. Working with Unix pipes Unix pipes are useful when passing the output of one program to the input of another. Consider the following example: $ echo "test case" | wc -l 1 In a Go application, the left-hand side of the pipe can be read in using os.Stdin and acts like a file descriptor. To demonstrate this, this recipe will take an input on the left-hand side of a pipe and return a list of words and their number of occurrences. These words will be tokenized on white space. Getting ready Refer to the Getting Ready section of the Using command-line arguments recipe. How to do it... From your terminal/console application, create a new directory, chapter2/pipes. Navigate to that directory and copy tests from https://github.com/agtorre/go-cookbook/tree/master/chapter2/pipes or use this as an exercise to write some of your own. Create a file called pipes.go with the following content: package main import ( "bufio" "fmt" "os" ) // WordCount takes a file and returns a map // with each word as a key and it's number of // appearances as a value func WordCount(f *os.File) map[string]int { result := make(map[string]int) // make a scanner to work on the file // io.Reader interface scanner := bufio.NewScanner(f) scanner.Split(bufio.ScanWords) for scanner.Scan() { result[scanner.Text()]++ } if err := scanner.Err(); err != nil { fmt.Fprintln(os.Stderr, "reading input:", err) } return result } func main() { fmt.Printf("string: number_of_occurrencesnn") for key, value := range WordCount(os.Stdin) { fmt.Printf("%s: %dn", key, value) } }   Run echo "some string" | go run pipes.go. You may also run: go build echo "some string" | ./pipes You should see the following output: $ echo "test case" | go run pipes.go string: number_of_occurrences test: 1 case: 1 $ echo "test case test" | go run pipes.go string: number_of_occurrences test: 2 case: 1 If you copied or wrote your own tests, go up one directory and run go test, and ensure that all tests pass. How it works... Working with pipes in go is pretty simple, especially if you're familiar with working with files. This recipe uses a scanner to tokenize the io.Reader interface of the os.Stdin file object. You can see how you must check for errors after completing all of the reads. An ANSI coloring application Coloring an ANSI terminal application is handled by a variety of code before and after a section of text that you want colored. This recipe will explore a basic coloring mechanism to color the text red or keep it plain. For a more complete application, take a look at https://github.com/agtorre/gocolorize, which supports many more colors and text types implements the fmt.Formatter interface for ease of printing. Getting ready Refer to the Getting Ready section of the Using command line arguments recipe. How to do it... From your terminal/console application, create and navigate to the chapter2/ansicolor directory. Copy tests from https://github.com/agtorre/go-cookbook/tree/master/chapter2/ansicolor or use this as an exercise to write some of your own. Create a file called color.go with the following content: package ansicolor import "fmt" //Color of text type Color int const ( // ColorNone is default ColorNone = iota // Red colored text Red // Green colored text Green // Yellow colored text Yellow // Blue colored text Blue // Magenta colored text Magenta // Cyan colored text Cyan // White colored text White // Black colored text Black Color = -1 ) // ColorText holds a string and its color type ColorText struct { TextColor Color Text string } func (r *ColorText) String() string { if r.TextColor == ColorNone { return r.Text } value := 30 if r.TextColor != Black { value += int(r.TextColor) } return fmt.Sprintf("33[0;%dm%s33[0m", value, r.Text) } Create a new directory named example. Navigate to example and then create a file named main.go with the following content. Ensure that you modify the ansicolor import to use the path you set up in step 1: package main import ( "fmt" "github.com/agtorre/go-cookbook/chapter2/ansicolor" ) func main() { r := ansicolor.ColorText{ansicolor.Red, "I'm red!"} fmt.Println(r.String()) r.TextColor = ansicolor.Green r.Text = "Now I'm green!" fmt.Println(r.String()) r.TextColor = ansicolor.ColorNone r.Text = "Back to normal..." fmt.Println(r.String()) } Run go run main.go. Alternatively, you may also run the following: go build ./example You should see the following with the text colored if your terminal supports the ANSI coloring format: $ go run main.go I'm red! Now I'm green! Back to normal... If you copied or wrote your own tests, go up one directory and run go test, and ensure that all the tests pass. How it works... This application makes use of a struct keyword to maintain state of the colored text. In this case, it stores the color of the text and the value of the text. The final string is rendered when you call the String() method, which will either return colored text or plain text depending on the values stored in the struct. By default, the text will be plain. Summary In this article, we demonstrated basic ways to construct a complex command-line application using arguments, including a package-level config, required positional arguments, multi-leveled command usage, and how to split these things into multiple files or packages if needed. We saw how to work with Unix pipes and explored a basic coloring mechanism to color text red or keep it plain. Resources for Article: Further resources on this subject: Building a Command-line Tool [article] A Command-line Companion Called Artisan [article] Scaffolding with the command-line tool [article]
Read more
  • 0
  • 0
  • 18487
Modal Close icon
Modal Close icon