Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-session-4-fair-classification
Sugandha Lahoti
23 Feb 2018
7 min read
Save for later

FAT Conference 2018 Session 4: Fair Classification

Sugandha Lahoti
23 Feb 2018
7 min read
As algorithms are increasingly used to make decisions of social consequence, the social values encoded in these decision-making procedures are the subject of increasing study, with fairness being a chief concern. The Conference on Fairness, Accountability, and Transparency (FAT) scheduled on Feb 23 and 24 this year in New York is an annual conference dedicated to bringing theory and practice of fair and interpretable Machine Learning, Information Retrieval, NLP, Computer Vision, Recommender systems, and other technical disciplines. This year's program includes 17 peer-reviewed papers and 6 tutorials from leading experts in the field. The conference will have three sessions. Session 4 of the two-day conference on Saturday, February 24, is in the field of fair classification. In this article, we give our readers a peek into the four papers that have been selected for presentation in Session 4. You can also check out Session 1,  Session 2, and Session 3 summaries in case you’ve missed them. The cost of fairness in binary classification What is the paper about? This paper provides a simple approach to the Fairness-aware problem which involves suitably thresholding class-probability estimates. It has been awarded Best paper in Technical contribution category. The authors have studied the inherent tradeoffs in learning classifiers with a fairness constraint in the form of two questions: What is the best accuracy we can expect for a given level of fairness? What is the nature of these optimal fairness aware classifiers? The authors showed that for cost-sensitive approximate fairness measures, the optimal classifier is an instance-dependent thresholding of the class probability function. They have quantified the degradation in performance by a measure of alignment of the target and sensitive variable. This analysis is then used to derive a simple plugin approach for the fairness problem. Key takeaways For Fairness-aware learning, the authors have designed an algorithm targeting a particular measure of fairness. They have reduced two popular fairness measures (disparate impact and mean difference) to cost-sensitive risks. They show that for cost-sensitive fairness measures, the optimal Fairness-aware classifier is an instance-dependent thresholding of the class-probability function. They quantify the intrinsic, method independent impact of the fairness requirement on accuracy via a notion of alignment between the target and sensitive feature. The ability to theoretically compute the tradeoffs between fairness and utility is perhaps the most interesting aspect of their technical results. They have stressed that the tradeoff is intrinsic to the underlying data. That is, any fairness or unfairness, is a property of the data, not of any particular technique. They have theoretically computed what price one has to pay (in utility) in order to achieve a desired degree of fairness: in other words, they have computed the cost of fairness. Decoupled Classifiers for Group-Fair and Efficient Machine Learning What is the paper about? This paper considers how to use a sensitive attribute such as gender or race to maximize fairness and accuracy, assuming that it is legal and ethical. Simple linear classifiers may use the raw data, upweight/oversample data from minority groups, or employ advanced approaches to fitting linear classifiers that aim to be accurate and fair. However, an inherent tradeoff between accuracy on one group and accuracy on another still prevails. This paper defines and explores decoupled classification systems, in which a separate classifier is trained on each group. The authors present experiments on 47 datasets. The experiments are “semi-synthetic” in the sense that the first binary feature was used as a substitute sensitive feature. The authors found that on many data sets the decoupling algorithm improves performance while less often decreasing performance. Key takeaways The paper describes a simple technical approach for a practitioner using ML to incorporate sensitive attributes. This approach avoids unnecessary accuracy tradeoffs between groups and can accommodate an application-specific objective, generalizing the standard ML notion of loss. For a certain family of “weakly monotonic” fairness objectives, the authors provide a black-box reduction that can use any off-the-shelf classifier to efficiently optimize the objective. This work requires the application designer to pin down a specific loss function that trades off accuracy for fairness. Experiments demonstrate that decoupling can reduce the loss on some datasets for some potentially sensitive features A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions What is the paper about? The work is based on the use of predictive analytics in the area of child welfare. It won the best paper award in the Technical and Interdisciplinary Contribution. The authors have worked on developing, validating, fairness auditing, and deploying a risk prediction model in Allegheny County, PA, USA. The authors have described competing models that are being developed in the Allegheny County as part of an ongoing redesign process in comparison to the previous models. Next, they investigate the predictive bias properties of the current tool and a Random forest model that has emerged as one of the best performing competing models. Their predictive bias assessment is motivated both by considerations of human bias and recent work on fairness criteria. They then discuss some of the challenges in incorporating algorithms into human decision-making processes and reflect on the predictive bias analysis in the context of how the model is actually being used. They also propose an “oracle test” as a tool for clarifying whether particular concerns pertain to the statistical properties of a model or if these concerns are targeted at other potential deficiencies. Key takeaways The goal in Allegheny County is to improve both the accuracy and equity of screening decisions by taking a Fairness-aware approach to incorporating prediction models into the decision-making pipeline. The paper reports on the lessons learned so far by the authors, their approaches to predictive bias assessment, and several outstanding challenges in the child maltreatment hotline context. This report contributes to the ongoing conversation concerning the use of algorithms in supporting critical decisions in government—and the importance of considering fairness and discrimination in data-driven decision making. The paper discussion and general analytic approach are also broadly applicable to other domains where predictive risk modeling may be used. Fairness in Machine Learning: Lessons from Political Philosophy What is the paper about? Plenty of moral and political philosophers have expended significant efforts in formalizing and defending the central concepts of discrimination, egalitarianism, and justice. Thus it is unsurprising to know that the attempts to formalize ‘fairness’ in machine learning contain echoes of these old philosophical debates. This paper draws on existing work in moral and political philosophy in order to elucidate emerging debates about fair machine learning. It answers the following questions: What does it mean for a machine learning model to be ‘fair’, in terms which can be operationalized? Should fairness consist of ensuring everyone has an equal probability of obtaining some benefit, or should we aim instead to minimize the harms to the least advantaged? Can the relevant ideal be determined by reference to some alternative state of affairs in which a particular social pattern of discrimination does not exist? Key takeaways This paper aims to provide an overview of some of the relevant philosophical literature on discrimination, fairness, and egalitarianism in order to clarify and situate the emerging debate within fair machine learning literature. The author addresses the conceptual distinctions drawn between terms frequently used in the fair ML literature–including ‘discrimination’ and ‘fairness’–and the use of related terms in the philosophical literature. He suggests that ‘fairness’ as used in the fair machine learning community is best understood as a placeholder term for a variety of normative egalitarian considerations. He also provides an overview of implications for the incorporation of ‘fairness’ into algorithmic decision-making systems. We hope you like the coverage of Session 4. Don’t miss our coverage on Session 5 on Fat recommenders and more.
Read more
  • 0
  • 0
  • 22236

article-image-working-with-pandas-dataframes
Sugandha Lahoti
23 Feb 2018
15 min read
Save for later

Working with pandas DataFrames

Sugandha Lahoti
23 Feb 2018
15 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book Python Data Analysis - Second Edition written by Armando Fandango. From this book, you will learn how to process and manipulate data with Python for complex data analysis and modeling. Code bundle for this article is hosted on GitHub.[/box] The popular open source Python library, pandas is named after panel data (an econometric term) and Python data analysis. We shall learn about basic panda functionalities, data structures, and operations in this article. The official pandas documentation insists on naming the project pandas in all lowercase letters. The other convention the pandas project insists on, is the import pandas as pd import statement. We will follow these conventions in this text. In this tutorial, we will install and explore pandas. We will also acquaint ourselves with the a central pandas data structure–DataFrame. Installing and exploring pandas The minimal dependency set requirements for pandas is given as follows: NumPy: This is the fundamental numerical array package that we installed and covered extensively in the preceding chapters python-dateutil: This is a date handling library pytz: This handles time zone definitions This list is the bare minimum; a longer list of optional dependencies can be located at http://pandas.pydata.org/pandas-docs/stable/install.html. We can install pandas via PyPI with pip or easy_install, using a binary installer, with the aid of our operating system package manager, or from the source by checking out the code. The binary installers can be downloaded from http://pandas.pydata.org/getpandas.html. The command to install pandas with pip is as follows: $ pip3 install pandas rpy2 rpy2 is an interface to R and is required because rpy is being deprecated. You may have to prepend the preceding command with sudo if your user account doesn't have sufficient rights. The pandas DataFrames A pandas DataFrame is a labeled two-dimensional data structure and is similar in spirit to a worksheet in Google Sheets or Microsoft Excel, or a relational database table. The columns in pandas DataFrame can be of different types. A similar concept, by the way, was invented originally in the R programming language. (For more information, refer to http://www.r-tutor.com/r-introduction/data-frame). A DataFrame can be created in the following ways: Using another DataFrame. Using a NumPy array or a composite of arrays that has a two-dimensional shape. Likewise, we can create a DataFrame out of another pandas data structure called Series. We will learn about Series in the following section. A DataFrame can also be produced from a file, such as a CSV file. From a dictionary of one-dimensional structures, such as one-dimensional NumPy arrays, lists, dicts, or pandas Series. As an example, we will use data that can be retrieved from http://www.exploredata.net/Downloads/WHO-Data-Set. The original data file is quite large and has many columns, so we will use an edited file instead, which only contains the first nine columns and is called WHO_first9cols.csv; the file is in the code bundle of this book. These are the first two lines, including the header: Country,CountryID,Continent,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) totalAfghanistan,1,1,151,28,,,,26088 In the next steps, we will take a look at pandas DataFrames and its attributes: To kick off, load the data file into a DataFrame and print it on the screen: from pandas.io.parsers import read_csv df = read_csv("WHO_first9cols.csv") print("Dataframe", df) The printout is a summary of the DataFrame. It is too long to be displayed entirely, so we will just grab the last few lines: 199 21732.0 200 11696.0 201 13228.0 [202 rows x 9 columns] The DataFrame has an attribute that holds its shape as a tuple, similar to ndarray. Query the number of rows of a DataFrame as follows: print("Shape", df.shape) print("Length", len(df)) The values we obtain comply with the printout of the preceding step: Shape (202, 9) Length 202 Check the column header and data types with the other attributes: print("Column Headers", df.columns) print("Data types", df.dtypes) We receive the column headers in a special data structure: Column Headers Index([u'Country', u'CountryID', u'Continent', u'Adolescent fertility rate (%)', u'Adult literacy rate (%)', u'Gross national income per capita (PPP international $)', u'Net primary school enrolment ratio female (%)', u'Net primary school enrolment ratio male (%)', u'Population (in thousands) total'], dtype='object') The data types are printed as follows: 4. The pandas DataFrame has an index, which is like the primary key of relational database tables. We can either specify the index or have pandas create it automatically. The index can be accessed with a corresponding property, as follows: Print("Index", df.index) An index helps us search for items quickly, just like the index in this book. In our case, the index is a wrapper around an array starting at 0, with an increment of one for each row: Sometimes, we wish to iterate over the underlying data of a DataFrame. Iterating over column values can be inefficient if we utilize the pandas iterators. It's much better to extract the underlying NumPy arrays and work with those. The pandas DataFrame has an attribute that can aid with this as well: print("Values", df.values) Please note that some values are designated nan in the output, for 'not a number'. These values come from empty fields in the input datafile: The preceding code is available in Python Notebook ch-03.ipynb, available in the code bundle of this book. Querying data in pandas Since a pandas DataFrame is structured in a similar way to a relational database, we can view operations that read data from a DataFrame as a query. In this example, we will retrieve the annual sunspot data from Quandl. We can either use the Quandl API or download the data manually as a CSV file from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. If you want to install the API, you can do so by downloading installers from https://pypi.python.org/pypi/Quandl or by running the following command: $ pip3 install Quandl Using the API is free, but is limited to 50 API calls per day. If you require more API calls, you will have to request an authentication key. The code in this tutorial is not using a key. It should be simple to change the code to either use a key or read a downloaded CSV file. If you have difficulties, search through the Python docs at https://docs.python.org/2/. Without further preamble, let's take a look at how to query data in a pandas DataFrame: As a first step, we obviously have to download the data. After importing the Quandl API, get the data as follows: import quandl # Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual # PyPi url https://pypi.python.org/pypi/Quandl sunspots = quandl.get("SIDC/SUNSPOTS_A") The head() and tail() methods have a purpose similar to that of the Unix commands with the same name. Select the first n and last n records of a DataFrame, where n is an integer parameter: print("Head 2", sunspots.head(2) ) print("Tail 2", sunspots.tail(2)) This gives us the first two and last two rows of the sunspot data (for the sake of brevity we have not shown all the columns here; your output will have all the columns from the dataset): Head 2          Number Year 1700-12-31      5 1701-12-31     11 [2 rows x 1 columns] Tail 2          Number Year 2012-12-31 57.7 2013-12-31 64.9 [2 rows x 1 columns] Please note that we only have one column holding the number of sunspots per year. The dates are a part of the DataFrame index. The following is the query for the last value using the last date: last_date = sunspots.index[-1] print("Last value", sunspots.loc[last_date]) You can check the following output with the result from the previous step: Last value Number        64.9 Name: 2013-12-31 00:00:00, dtype: float64 Query the date with date strings in the YYYYMMDD format as follows: print("Values slice by date:n", sunspots["20020101": "20131231"]) This gives the records from 2002 through to 2013: Values slice by date                             Number Year 2002-12-31     104.0 [TRUNCATED] 2013-12-31       64.9 [12 rows x 1 columns] A list of indices can be used to query as well: print("Slice from a list of indices:n", sunspots.iloc[[2, 4, -4, -2]]) The preceding code selects the following rows: Slice from a list of indices                              Number Year 1702-12-31       16.0 1704-12-31       36.0 2010-12-31       16.0 2012-12-31       57.7 [4 rows x 1 columns] To select scalar values, we have two options. The second option given here should be faster. Two integers are required, the first for the row and the second for the column: print("Scalar with Iloc:", sunspots.iloc[0, 0]) print("Scalar with iat", sunspots.iat[1, 0]) This gives us the first and second values of the dataset as scalars: Scalar with Iloc 5.0 Scalar with iat 11.0 Querying with Booleans works much like the Where clause of SQL. The following code queries for values larger than the arithmetic mean. Note that there is a difference between when we perform the query on the whole DataFrame and when we perform it on a single column: print("Boolean selection", sunspots[sunspots > sunspots.mean()]) print("Boolean selection with column label:n", sunspots[sunspots['Number of Observations'] > sunspots['Number of Observations'].mean()]) The notable difference is that the first query yields all the rows, with some rows not conforming to the condition that has a value of NaN. The second query returns only the rows where the value is larger than the mean: Boolean selection                             Number Year 1700-12-31          NaN [TRUNCATED] 1759-12-31       54.0 ... [314 rows x 1 columns] Boolean selection with column label                              Number Year 1705-12-31       58.0 [TRUNCATED] 1870-12-31     139.1 ... [127 rows x 1 columns] The preceding example code is in the ch_03.ipynb file of this book's code bundle. Data aggregation with pandas DataFrames Data aggregation is a term used in the field of relational databases. In a database query, we can group data by the value in a column or columns. We can then perform various operations on each of these groups. The pandas DataFrame has similar capabilities. We will generate data held in a Python dict and then use this data to create a pandas DataFrame. We will then practice the pandas aggregation features: Seed the NumPy random generator to make sure that the generated data will not differ between repeated program runs. The data will have four columns: Weather (a string) Food (also a string) Price (a random float) Number (a random integer between one and nine) The use case is that we have the results of some sort of consumer-purchase research, combined with weather and market pricing, where we calculate the average of prices and keep a track of the sample size and parameters: import pandas as pd from numpy.random import seed from numpy.random import rand from numpy.random import rand_int import numpy as np seed(42) df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold','hot', 'cold', 'hot', 'cold'], 'Food' : ['soup', 'soup', 'icecream', 'chocolate', 'icecream', 'icecream', 'soup'], 'Price' : 10 * rand(7), 'Number' : rand_int(1, 9,)}) print(df) You should get an output similar to the following: Please note that the column labels come from the lexically ordered keys of the Python dict. Lexical or lexicographical order is based on the alphabetic order of characters in a string. Group the data by the Weather column and then iterate through the groups as follows: weather_group = df.groupby('Weather') i = 0 for name, group in weather_group: i = i + 1 print("Group", i, name) print(group) We have two types of weather, hot and cold, so we get two groups: The weather_group variable is a special pandas object that we get as a result of the groupby() method. This object has aggregation methods, which are demonstrated as follows: print("Weather group firstn", weather_group.first()) print("Weather group lastn", weather_group.last()) print("Weather group meann", weather_group.mean()) The preceding code snippet prints the first row, last row, and mean of each group: Just as in a database query, we are allowed to group on multiple columns. The groups attribute will then tell us the groups that are formed, as well as the rows in each group: wf_group = df.groupby(['Weather', 'Food']) print("WF Groups", wf_group.groups) For each possible combination of weather and food values, a new group is created. The membership of each row is indicated by their index values as follows: WF Groups {('hot', 'chocolate'): [3], ('cold', 'icecream'): [2, 4], ('hot', 'icecream'): [5], ('hot', 'soup'): [1], ('cold', 'soup'): [0, 6] 5. Apply a list of NumPy functions on groups with the agg() method: print("WF Aggregatedn", wf_group.agg([np.mean, np.median])) Obviously, we could apply even more functions, but it would look messier than the following output: Concatenating and appending DataFrames The pandas DataFrame allows operations that are similar to the inner and outer joins of database tables. We can append and concatenate rows as well. To practice appending and concatenating of rows, we will reuse the DataFrame from the previous section. Let's select the first three rows: print("df :3n", df[:3]) Check that these are indeed the first three rows: df :3 Food Number        Price Weather 0           soup              8 3.745401       cold 1           soup              5 9.507143         hot 2 icecream              4 7.319939       cold The concat() function concatenates DataFrames. For example, we can concatenate a DataFrame that consists of three rows to the rest of the rows, in order to recreate the original DataFrame: print("Concat Back togethern", pd.concat([df[:3], df[3:]])) The concatenation output appears as follows: Concat Back together Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold 3 chocolate 8 5.986585 hot 4 icecream 8 1.560186 cold 5 icecream 3 1.559945 hot 6 soup 6 0.580836 cold [7 rows x 4 columns] To append rows, use the append() function: print("Appending rowsn", df[:3].append(df[5:])) The result is a DataFrame with the first three rows of the original DataFrame and the last two rows appended to it: Appending rows Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold 5 icecream 3 1.559945 hot 6 soup 6 0.580836 cold [5 rows x 4 columns] Joining DataFrames To demonstrate joining, we will use two CSV files-dest.csv and tips.csv. The use case behind it is that we are running a taxi company. Every time a passenger is dropped off at his or her destination, we add a row to the dest.csv file with the employee number of the driver and the destination: EmpNr,Dest5,The Hague3,Amsterdam9,Rotterdam Sometimes drivers get a tip, so we want that registered in the tips.csv file (if this doesn't seem realistic, please feel free to come up with your own story): EmpNr,Amount5,109,57,2.5 Database-like joins in pandas can be done with either the merge() function or the join() DataFrame method. The join() method joins onto indices by default, which might not be what you want. In SQL a relational database query language we have the inner join, left outer join, right outer join, and full outer join. An inner join selects rows from two tables, if and only if values match, for columns specified in the join condition. Outer joins do not require a match, and can potentially return more rows. More information on joins can be found at http://en.wikipedia.org/wiki/Join_%28SQL%29. All these join types are supported by pandas, but we will only take a look at inner joins and full outer joins: A join on the employee number with the merge() function is performed as follows: print("Merge() on keyn", pd.merge(dests, tips, on='EmpNr')) This gives an inner join as the outcome: Merge() on key EmpNr            Dest           Amount 0 5 The Hague 10 1 9 Rotterdam 5 [2 rows x 3 columns] Joining with the join() method requires providing suffixes for the left and right operands: print("Dests join() tipsn", dests.join(tips, lsuffix='Dest', rsuffix='Tips')) This method call joins index values so that the result is different from an SQL inner join: Dests join() tips EmpNrDest Dest EmpNrTips Amount 0 5 The Hague 5 10.0 1 3 Amsterdam 9 5.0 2 9 Rotterdam 7 2.5 [3 rows x 4 columns] An even more explicit way to execute an inner join with merge() is as follows: print("Inner join with merge()n", pd.merge(dests, tips, how='inner')) The output is as follows: Inner join with merge() EmpNr            Dest           Amount 0 5 The Hague 10 1 9 Rotterdam 5 [2 rows x 3 columns] To make this a full outer join requires only a small change: print("Outer joinn", pd.merge(dests, tips, how='outer')) The outer join adds rows with NaN values: Outer join EmpNr            Dest            Amount 0 5 The Hague 10.0 1 3 Amsterdam NaN 2 9 Rotterdam 5.0 3 7 NaN            2.5 [4 rows x 3 columns] In a relational database query, these values would have been set to NULL. The demo code is in the ch-03.ipynb file of this book's code bundle. We learnt how to perform various data manipulation techniques such as aggregating, concatenating, appending, cleaning, and handling missing values, with pandas. If you found this post useful, check out the book Python Data Analysis - Second Edition to learn advanced topics such as signal processing, textual data analysis, machine learning, and more.  
Read more
  • 0
  • 0
  • 16966

article-image-how-to-configure-metricbeat-for-application-and-server-infrastructure
Pravin Dhandre
23 Feb 2018
8 min read
Save for later

How to Configure Metricbeat for Application and Server infrastructure

Pravin Dhandre
23 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book provides detailed understanding in how you can employ Elastic Stack in performing distributed analytics along with resolving various data processing challenges.[/box] In today’s tutorial, we will show the step-by-step configuration of Metricbeat, a Beats platform for monitoring server and application infrastructure. Configuring Metricbeat   The configurations related to Metricbeat are stored in a configuration file named metricbeat.yml, and it uses YAML syntax. The metricbeat.yml file contains the following: Module configuration General settings Output configuration Processor configuration Path configuration Dashboard configuration Logging configuration Let's explore some of these sections. Module configuration Metricbeat comes bundled with various modules to collect metrics from the system and applications such as Apache, MongoDB, Redis, MySQL, and so on. Metricbeat provides two ways of enabling modules and metricsets: Enabling module configs in the modules.d directory Enabling module configs in the metricbeat.yml file Enabling module configs in the modules.d directory The modules.d directory contains default configurations for all the modules available in Metricbeat. The configuration specific to a module is stored in a .yml file with the name of the file being the name of the module. For example, the configuration related to the MySQL module would be stored in the mysql.yml file. By default, excepting the system module, all other modules are disabled. To list the modules that are available in Metricbeat, execute the following command: Windows: D:packtmetricbeat-6.0.0-windows-x86_64>metricbeat.exe modules list Linux: [locationOfMetricBeat]$./metricbeat modules list The modules list command displays all the available modules and also lists which modules are currently enabled/disabled. As each module comes with the default configurations, make the appropriate changes in the module configuration file. The basic configuration for mongodb module will look as follows: - module: mongodb metricsets: ["dbstats", "status"] period: 10s hosts: ["localhost:27017"] username: user password: pass To enable it, execute the modules enable command, passing one or more module name. For example: Windows: D:packtmetricbeat-6.0.0-windows-x86_64>metricbeat.exe modules enable redis mongodb Linux: [locationOfMetricBeat]$./metricbeat modules enable redis mongodb Similar to disable modules, execute the modules disable command, passing one or more module names to it. For example: Windows: D:packtmetricbeat-6.0.0-windows-x86_64>metricbeat.exe modules disable redis mongodb Linux: [locationOfMetricBeat]$./metricbeat modules disable redis mongodb To enable dynamic config reloading, set reload.enabled to true and to specify the frequency to look for config file changes. Set the reload.period parameter under the metricbeat.config.modules property. For example: #metricbeat.yml metricbeat.config.modules: path: ${path.config}/modules.d/*.yml reload.enabled: true reload.period: 20s Enabling module config in the metricbeat.yml file If one is used to earlier versions of Metricbeat, one can enable the modules and metricsets in the metricbeat.yml file directly by adding entries to the metricbeat.modules list. Each entry in the list begins with a dash (-) and is followed by the settings for that module. For Example: metricbeat.modules: #------------------ Memcached Module ----------------------------- - module: memcached metricsets: ["stats"] period: 10s hosts: ["localhost:11211"] #------------------- MongoDB Module ------------------------------ - module: mongodb metricsets: ["dbstats", "status"] period: 5s It is possible to specify the module multiple times and specify a different period to use for one or more metricset. For example: #------- Couchbase Module ----------------------------- - module: couchbase metricsets: ["bucket"] period: 15s hosts: ["localhost:8091"] - module: couchbase metricsets: ["cluster", "node"] period: 30s hosts: ["localhost:8091"] General settings This section contains configuration options and some general settings to control the behavior of Metricbeat. Some of the configuration options/settings are: name: The name of the shipper that publishes the network data. By default, hostname is used for this field: name: "dc1-host1" tags: The list of tags that will be included in the tags field of every event Metricbeat ships. Tags make it easy to group servers by different logical properties and help when filtering events in Kibana and Logstash: tags: ["staging", "web-tier","dc1"] max_procs: The maximum number of CPUs that can be executing simultaneously. The default is the number of logical CPUs available in the System: max_procs: 2 Output configuration This section is used to configure outputs where the events need to be shipped. Events can be sent to single or multiple outputs simultaneously. The allowed outputs are Elasticsearch, Logstash, Kafka, Redis, file, and console. Some of the outputs that can be configured are as follows: elasticsearch: It is used to send the events directly to Elasticsearch. A sample Elasticsearch output configuration is shown in the following code snippet: output.elasticsearch: enabled: true hosts: ["localhost:9200"] Using the enabled setting, one can enable or disable the output. hosts accepts one or more Elasticsearch node/server. Multiple hosts can be defined for failover purposes. When multiple hosts are configured, the events are distributed to these nodes in round robin order. If Elasticsearch is secured, then the credentials can be passed using the username and password settings: output.elasticsearch: enabled: true hosts: ["localhost:9200"] username: "elasticuser" password: "password" To ship the events to the Elasticsearch ingest node pipeline so that they can be pre-processed before being stored in Elasticsearch, the pipeline information can be provided using the pipleline setting: output.elasticsearch: enabled: true hosts: ["localhost:9200"] pipeline: "ngnix_log_pipeline" The default index the data gets written to is of the format metricbeat-%{[beat.version]}-%{+yyyy.MM.dd}. This will create a new index every day. For example if today is December 2, 2017 then all the events are placed in the metricbeat-6.0.0-2017-12-02 index. One can override the index name or the pattern using the index setting. In the following configuration snippet, a new index is created for every month: output.elasticsearch: hosts: ["http://localhost:9200"] index: "metricbeat-%{[beat.version]}-%{+yyyy.MM}" Using the indices setting, one can conditionally place the events in the appropriate index that matches the specified condition. In the following code snippet, if the message contains the DEBUG string, it will be placed in the debug-%{+yyyy.MM.dd} index. If the message contains the ERR string, it will be placed in the error-%{+yyyy.MM.dd} index. If the message contains neither of these texts, then those events will be pushed to the logs-%{+yyyy.MM.dd} index as specified in the index parameter: output.elasticsearch: hosts: ["http://localhost:9200"] index: "logs-%{+yyyy.MM.dd}" indices: - index: "debug-%{+yyyy.MM.dd}" when.contains: message: "DEBUG" - index: "error-%{+yyyy.MM.dd}" when.contains: message: "ERR" When the index parameter is overridden, disable templates and dashboards by adding the following setting in: setup.dashboards.enabled: false setup.template.enabled: false Alternatively, provide the value for setup.template.name and setup.template.pattern in the metricbeat.yml configuration file, or else Metricbeat will fail to run. logstash: It is used to send the events to Logstash. To use Logstash as the output, Logstash needs to be configured with the Beats input plugin to receive incoming Beats events. A sample Logstash output configuration is as follows: output.logstash: enabled: true hosts: ["localhost:5044"] Using the enabled setting, one can enable or disable the output. hosts accepts one or more Logstash servers. Multiple hosts can be defined for failover purposes. If the configured host is unresponsive, then the event will be sent to one of the other configured hosts. When multiple hosts are configured, the events are distributed in random order. To enable load balancing of events across the Logstash hosts, use the loadbalance flag, set to true: output.logstash: hosts: ["localhost:5045", "localhost:5046"] loadbalance: true console: It is used to send the events to stdout. The events are written in JSON format. It is useful during debugging or testing. A sample console configuration is as follows: output.console: enabled: true pretty: true Logging This section contains the options for configuring the Filebeat logging output. The logging system can write logs to syslog or rotate log files. If logging is not explicitly configured, file output is used on Windows systems, and syslog output is used on Linux and OS X. A sample configuration is as follows: logging.level: debug logging.to_files: true logging.files: path: C:logsmetricbeat name: metricbeat.log keepfiles: 10 Some of the configuration options are: level: To specify the logging level. to_files: To write all logging output to files. The files are subject to file rotation. This is the default value. to_syslog: To write the logging output to syslogs if this setting is set to true. files.path, files.name, and files.keepfiles: These are used to specify the location of the file, the name We successfully configured Beat Library, MetricBeat and developed good transmission of operational metrics to Elasticsearch, making it easy to monitor systems and services on servers with much ease. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 to examine the fundamentals of Elastic Stack in detail and start developing solutions for problems like logging, site search, app search, metrics and more.      
Read more
  • 0
  • 0
  • 20688

article-image-how-to-implement-dynamic-sql-in-postgresql-10
Amey Varangaonkar
23 Feb 2018
7 min read
Save for later

How to implement Dynamic SQL in PostgreSQL 10

Amey Varangaonkar
23 Feb 2018
7 min read
In this PostgreSQL tutorial, we'll take a close look at the concept of dynamic SQL, and how it can make the life of database programmers easy by allowing efficient querying of data. This tutorial has been taken from the second edition of Learning PostgreSQL 10. You can read more here. Dynamic SQL is used to reduce repetitive tasks when it comes to querying. For example, one could use dynamic SQL to create table partitioning for a certain table on a daily basis, to add missing indexes on all foreign keys, or add data auditing capabilities to a certain table without major coding effects. Another important use of dynamic SQL is to overcome the side effects of PL/pgSQL caching, as queries executed using the EXECUTE statement are not cached. Dynamic SQL is achieved via the EXECUTE statement. The EXECUTE statement accepts a string and simply evaluates it. The synopsis to execute a statement is given as follows: EXECUTE command-string [ INTO [STRICT] target ] [ USING expression [, ...] ]; Executing DDL statements in dynamic SQL In some cases, one needs to perform operations at the database object level, such as tables, indexes, columns, roles, and so on. For example, a database developer would like to vacuum and analyze a specific schema object, which is a common task after the deployment in order to update the statistics. For example, to analyze the car_portal_app schema tables, one could write the following script: DO $$ DECLARE table_name text; BEGIN FOR table_name IN SELECT tablename FROM pg_tables WHERE schemaname ='car_portal_app' LOOP RAISE NOTICE 'Analyzing %', table_name; EXECUTE 'ANALYZE car_portal_app.' || table_name; END LOOP; END; $$; Executing DML statements in dynamic SQL Some applications might interact with data in an interactive manner. For example, one might have billing data generated on a monthly basis. Also, some applications filter data on different criteria defined by the user. In such cases, dynamic SQL is very convenient. For example, in the car portal application, the search functionality is needed to get accounts using the dynamic predicate, as follows: CREATE OR REPLACE FUNCTION car_portal_app.get_account (predicate TEXT) RETURNS SETOF car_portal_app.account AS $$ BEGIN RETURN QUERY EXECUTE 'SELECT * FROM car_portal_app.account WHERE ' || predicate; END; $$ LANGUAGE plpgsql; To test the previous function: car_portal=> SELECT * FROM car_portal_app.get_account ('true') limit 1; account_id | first_name | last_name | email | password ------------+------------+-----------+-----------------+------------------- --------------- 1 | James | Butt | jbutt@gmail.com | 1b9ef408e82e38346e6ebebf2dcc5ece (1 row) car_portal=> SELECT * FROM car_portal_app.get_account (E'first_name='James''); account_id | first_name | last_name | email | password ------------+------------+-----------+-----------------+------------------- --------------- 1 | James | Butt | jbutt@gmail.com | 1b9ef408e82e38346e6ebebf2dcc5ece (1 row) Dynamic SQL and the caching effect As mentioned earlier, PL/pgSQL caches execution plans. This is quite good if the generated plan is expected to be static. For example, the following statement is expected to use an index scan because of selectivity. In this case, caching the plan saves some time and thus increases performance: SELECT * FROM account WHERE account_id =<INT> In other scenarios, however, this is not true. For example, let's assume we have an index on the advertisement_date column and we would like to get the number of advertisements since a certain date, as follows: SELECT count (*) FROM car_portal_app.advertisement WHERE advertisement_date >= <certain_date>; In the preceding query, the entries from the advertisement table can be fetched from the hard disk either by using the index scan or using the sequential scan based on selectivity, which depends on the provided certain_date value. Caching the execution plan of such a query will cause serious problems; thus, writing the function as follows is not a good idea: CREATE OR REPLACE FUNCTION car_portal_app.get_advertisement_count (some_date timestamptz ) RETURNS BIGINT AS $$ BEGIN RETURN (SELECT count (*) FROM car_portal_app.advertisement WHERE advertisement_date >=some_date)::bigint; END; $$ LANGUAGE plpgsql; To solve the caching issue, one could rewrite the previous function either using the SQL language function or by using the PL/pgSQL execute command, as follows: CREATE OR REPLACE FUNCTION car_portal_app.get_advertisement_count (some_date timestamptz ) RETURNS BIGINT AS $$ DECLARE count BIGINT; BEGIN EXECUTE 'SELECT count (*) FROM car_portal_app.advertisement WHERE advertisement_date >= $1' USING some_date INTO count; RETURN count; END; $$ LANGUAGE plpgsql; Recommended practices for dynamic SQL usage Dynamic SQL can cause security issues if not handled carefully; dynamic SQL is vulnerable to the SQL injection technique. SQL injection is used to execute SQL statements that reveal secure information, or even to destroy data in a database. A very simple example of a PL/pgSQL function vulnerable to SQL injection is as follows: CREATE OR REPLACE FUNCTION car_portal_app.can_login (email text, pass text) RETURNS BOOLEAN AS $$ DECLARE stmt TEXT; result bool; BEGIN stmt = E'SELECT COALESCE (count(*)=1, false) FROM car_portal_app.account WHERE email = ''|| $1 || E'' and password = ''||$2||E'''; RAISE NOTICE '%' , stmt; EXECUTE stmt INTO result; RETURN result; END; $$ LANGUAGE plpgsql; The preceding function returns true if the email and the password match. To test this function, let's insert a row and try to inject some code, as follows: car_portal=> SELECT car_portal_app.can_login('jbutt@gmail.com', md5('jbutt@gmail.com')); NOTICE: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = 'jbutt@gmail.com' and password = '1b9ef408e82e38346e6ebebf2dcc5ece' Can_login ----------- t (1 row) car_portal=> SELECT car_portal_app.can_login('jbutt@gmail.com', md5('jbutt@yahoo.com')); NOTICE: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = 'jbutt@gmail.com' and password = '37eb43e4d439589d274b6f921b1e4a0d' can_login ----------- f (1 row) car_portal=> SELECT car_portal_app.can_login(E'jbutt@gmail.com'--', 'Do not know password'); NOTICE: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = 'jbutt@gmail.com'--' and password = 'Do not know password' can_login ----------- t (1 row) Notice that the function returns true even when the password does not match the password stored in the table. This is simply because the predicate was commented, as shown by the raise notice: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = 'jbutt@gmail.com'--' and password = 'Do not know password' To protect code against this technique, one could follow these practices: For parameterized dynamic SQL statements, use the USING clause. Use the format function with appropriate interpolation to construct your queries. Note that %I escapes the argument as an identifier and %L as a literal. Use quote_ident(), quote_literal(), and quote_nullable() to properly format your identifiers and literal. One way to write the preceding function is as follows: CREATE OR REPLACE FUNCTION car_portal_app.can_login (email text, pass text) RETURNS BOOLEAN AS $$ DECLARE stmt TEXT; result bool; BEGIN stmt = format('SELECT COALESCE (count(*)=1, false) FROM car_portal_app.account WHERE email = %Land password = %L', $1,$2); RAISE NOTICE '%' , stmt; EXECUTE stmt INTO result; RETURN result; END; $$ LANGUAGE plpgsql; We saw how dynamically SQL is used to build and execute queries on the fly. Unlike the static SQL statement, a dynamic SQL statements’ full text is unknown and can change between successive executions. These queries can be DDL, DCL, and/or DML statements. If you found this article useful, make sure to check out the book Learning PostgreSQL 10, to learn the fundamentals of PostgreSQL 10.  
Read more
  • 0
  • 1
  • 108850

article-image-getting-started-with-apache-kafka-clusters
Amarabha Banerjee
23 Feb 2018
10 min read
Save for later

Getting Started with Apache Kafka Clusters

Amarabha Banerjee
23 Feb 2018
10 min read
[box type="note" align="" class="" width=""]Below given article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book contains easy to follow recipes to help you set-up, configure and use Apache Kafka in the best possible manner.[/box] Here in this article, we are going to talk about how you can get started with Apache Kafka clusters and implement them seamlessly. In Apache Kafka there are three types of clusters: Single-node single-broker Single-node multiple-broker Multiple-node multiple-broker cluster The following four recipes show how to run Apache Kafka in these clusters. Configuring a single-node single-broker cluster – SNSB The first cluster configuration is single-node single-broker (SNSB). This cluster is very useful when a single point of entry is needed. Yes, its architecture resembles the singleton design pattern. A SNSB cluster usually satisfies three requirements: Controls concurrent access to a unique shared broker Access to the broker is requested from multiple, disparate producers There can be only one broker If the proposed design has only one or two of these requirements, a redesign is almost always the correct option. Sometimes, the single broker could become a bottleneck or a single point of failure. But it is useful when a single point of communication is needed. Getting ready Go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it... The diagram shows an example of an SNSB cluster: Starting ZooKeeper Kafka provides a simple ZooKeeper configuration file to launch a single ZooKeeper instance. To install the ZooKeeper instance, use this command: > bin/zookeeper-server-start.sh config/zookeeper.properties The main properties specified in the zookeeper.properties file are: clientPort: This is the listening port for client requests. By default, ZooKeeper listens on TCP port 2181: clientPort=2181 dataDir: This is the directory where ZooKeeper is stored: dataDir=/tmp/zookeeper means unbounded): maxClientCnxns=0 For more information about Apache ZooKeeper visit the project home page at: http://zookeeper.apache.org/. Starting the broker After ZooKeeper is started, start the Kafka broker with this command: > bin/kafka-server-start.sh config/server.properties The main properties specified in the server.properties file are: broker.id: The unique positive integer identifier for each broker: broker.id=0 log.dir: Directory to store log files: log.dir=/tmp/kafka10-logs num.partitions: The number of log partitions per topic: num.partitions=2 port: The port that the socket server listens on: port=9092 zookeeper.connect: The ZooKeeper URL connection: zookeeper.connect=localhost:2181 How it works Kafka uses ZooKeeper for storing metadata information about the brokers, topics, and partitions. Writes to ZooKeeper are performed only on changes of consumer group membership or on changes to the Kafka cluster itself. This amount of traffic is minimal, and there is no need for a dedicated ZooKeeper ensemble for a single Kafka cluster. Actually, many deployments use a single ZooKeeper ensemble to control multiple Kafka clusters (using a chroot ZooKeeper path for each cluster). SNSB – creating a topic, producer, and consumer The SNSB Kafka cluster is running; now let's create topics, producer, and consumer. Getting ready We need the previous recipe executed: Kafka already installed ZooKeeper up and running A Kafka server up and running Now, go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it The following steps will show you how to create an SNSB topic, producer, and consumer. Creating a topic As we know, Kafka has a command to create topics. Here we create a topic called SNSBTopic with one partition and one replica: > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 1 --partitions 1 --topic SNSBTopic We obtain the following output: Created topic "SNSBTopic". The command parameters are: --replication-factor 1: This indicates just one replica --partition 1: This indicates just one partition --zookeeper localhost:2181: This indicates the ZooKeeper URL As we know, to get the list of topics on a Kafka server we use the following command: > bin/kafka-topics.sh --list --zookeeper localhost:2181 We obtain the following output: SNSBTopic Starting the producer Kafka has a command to start producers that accepts inputs from the command line and publishes each input line as a message. By default, each new line is considered a message: > bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic SNSBTopic This command requires two parameters: broker-list: The broker URL to connect to topic: The topic name (to send a message to the topic subscribers) Now, type the following in the command line: The best thing about a boolean is [Enter] even if you are wrong [Enter] you are only off by a bit. [Enter] This output is obtained (as expected): The best thing about a boolean is even if you are wrong you are only off by a bit. The producer.properties file has the producer configuration. Some important properties defined in the producer.properties file are: metadata.broker.list: The list of brokers used for bootstrapping information on the rest of the cluster in the format host1:port1, host2:port2: metadata.broker.list=localhost:9092 compression.codec: The compression codec used. For example, none, gzip, and snappy: compression.codec=none Starting the consumer Kafka has a command to start a message consumer client. It shows the output in the command line as soon as it has subscribed to the topic: > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic SNSBTopic --from-beginning Note that the parameter from-beginning is to show the entire log: The best thing about a boolean is even if you are wrong you are only off by a bit. One important property defined in the consumer.properties file is: group.id: This string identifies the consumers in the same group: group.id=test-consumer-group There's more It is time to play with this technology. Open a new command-line window for ZooKeeper, a broker, two producers, and two consumers. Type some messages in the producers and watch them get displayed in the consumers. If you don't know or don't remember how to run the commands, run it with no arguments to display the possible values for the Parameters. Configuring a single-node multiple-broker cluster – SNMB The second cluster configuration is single-node multiple-broker (SNMB). This cluster is used when there is just one node but inner redundancy is needed. When a topic is created in Kafka, the system determines how each replica of a partition is mapped to each broker. In general, Kafka tries to spread the replicas across all available brokers. The messages are first sent to the first replica of a partition (to the current broker leader of that partition) before they are replicated to the remaining brokers. The producers may choose from different strategies for sending messages (synchronous or asynchronous mode). Producers discover the available brokers in a cluster and the partitions on each (all this by registering watchers in ZooKeeper). In practice, some of the high volume topics are configured with more than one partition per broker. Remember that having more partitions increases the I/O parallelism for writes and this increases the degree of parallelism for consumers (the partition is the unit for distributing data to consumers). On the other hand, increasing the number of partitions increases the overhead because: There are more files, so more open file handlers There are more offsets to be checked by consumers, so the ZooKeeper load is increased The art of this is to balance these tradeoffs. Getting ready Go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka The following diagram shows an example of an SNMB cluster: How to do it Begin starting the ZooKeeper server as follows: > bin/zookeeper-server-start.sh config/zookeeper.properties A different server.properties file is needed for each broker. Let's call them: server-1.properties, server-2.properties, server-3.properties, and so on (original, isn't it?). Each file is a copy of the original server.properties file. In the server-1.properties file set the following properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 Similarly, in the server-2.properties file set the following properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2 Finally, in the server-3.properties file set the following properties: broker.id=3 port=9095 log.dir=/tmp/kafka-logs-3 With ZooKeeper running, start the Kafka brokers with these commands: > bin/kafka-server-start.sh config/server-1.properties > bin/kafka-server-start.sh config/server-2.properties > bin/kafka-server-start.sh config/server-3.properties How it works Now the SNMB cluster is running. The brokers are running on the same Kafka node, on ports 9093, 9094, and 9095. SNMB – creating a topic, producer, and consumer The SNMB Kafka cluster is running; now let's create topics, producer, and consumer. Getting ready We need the previous recipe executed: Kafka already installed  ZooKeeper up and running A Kafka server up and running Now, go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users):  > cd /usr/local/kafka How to do it The following steps will show you how to create an SNMB topic, producer, and consumer Creating a topic Using the command to create topics, let's create a topic called SNMBTopic with two partitions and two replicas: > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 2 --partitions 3 --topic SNMBTopic The following output is displayed: Created topic "SNMBTopic" This command has the following effects: Kafka will create three logical partitions for the topic. Kafka will create two replicas (copies) per partition. This means, for each partition it will pick two brokers that will host those replicas. For each partition, Kafka will randomly choose a broker Leader. Now ask Kafka for the list of available topics. The list now includes the new SNMBTopic: > bin/kafka-topics.sh --zookeeper localhost:2181 --list SNMBTopic Starting a producer Now, start the producers; indicating more brokers in the broker-list is easy: > bin/kafka-console-producer.sh --broker-list localhost:9093, localhost:9094, localhost:9095 --topic SNMBTopic If it's necessary to run multiple producers connecting to different brokers, specify a different broker list for each producer. Starting a consumer To start a consumer, use the following command: > bin/kafka-console-consumer.sh -- zookeeper localhost:2181 --frombeginning --topic SNMBTopic How it works The first important fact is the two parameters: replication-factor and partitions. The replication-factor is the number of replicas each partition will have in the topic created. The partitions parameter is the number of partitions for the topic created. There's more If you don't know the cluster configuration or don't remember it, there is a useful option for the kafka-topics command, the describe parameter: > bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic SNMBTopic The output is something similar to: Topic:SNMBTopic PartitionCount:3 ReplicationFactor:2 Configs: Topic: SNMBTopic Partition: 0 Leader: 2 Replicas: 2,3 Isr: 3,2 Topic: SNMBTopic Partition: 1 Leader: 3 Replicas: 3,1 Isr: 1,3 Topic: SNMBTopic Partition: 2 Leader: 1 Replicas: 1,2 Isr: 1,2 An explanation of the output: the first line gives a summary of all the partitions; each line gives information about one partition. Since we have three partitions for this topic, there are three lines: Leader: This node is responsible for all reads and writes for a particular partition. For a randomly selected section of the partitions each node is the leader. Replicas: This is the list of nodes that duplicate the log for a particular partition irrespective of whether it is currently alive. Isr: This is the set of in-sync replicas. It is a subset of the replicas currently alive and following the leader. In order to see the options for: create, delete, describe, or change a topic, type this command without parameters: > bin/kafka-topics.sh We discussed how to implement Apache Kafka clusters effectively. If you liked this post, be sure to check out Apache Kafka 1.0 Cookbook which consists of useful recipes to work with Apache Kafka installation.  
Read more
  • 0
  • 1
  • 7376

article-image-fat-2018-conference-session-2-summary-interpretability-explainability
Savia Lobo
22 Feb 2018
5 min read
Save for later

FAT* 2018 Conference Session 2 Summary: Interpretability and Explainability

Savia Lobo
22 Feb 2018
5 min read
This session of the FAT* 2018 is about interpretability and explainability in machine learning models. With the advances in Deep learning, machine learning models have become more accurate. However, with accuracy and advancements, it is a tough task to keep the models highly explainable. This means, these models may appear as black boxes to business users, who utilize them without knowing what lies within. Thus, it is equally important to make ML models interpretable and explainable, which can be beneficial and essential for understanding ML models and to have a ‘behind the scenes’ knowledge of what’s happening within them. This understanding can be highly essential for heavily regulated industries like Finance, Medicine, Defence and so on. The Conference on Fairness, Accountability, and Transparency (FAT), which would be held on the 23rd and 24th of February, 2018 is a multi-disciplinary conference that brings together researchers and practitioners interested in fairness, accountability, and transparency in socio-technical systems. The FAT 2018 conference will witness 17 research papers, 6 tutorials, and 2 keynote presentations from leading experts in the field. This article covers research papers pertaining to the 2nd session that is dedicated to Interpretability and Explainability of machine-learned decisions. If you’ve missed our summary of the 1st session on Online Discrimination and Privacy, visit the article link for a catch up. Paper 1: Meaningful Information and the Right to Explanation This paper addresses an active debate in policy, industry, academia, and the media about whether and to what extent Europe’s new General Data Protection Regulation (GDPR) grants individuals a “right to explanation” of automated decisions. The paper explores two major papers, European Union Regulations on Algorithmic Decision Making and a “Right to Explanation” by Goodman and Flaxman (2017) Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation by Wachter et al. (2017) This paper demonstrates that the specified framework is built on incorrect legal and technical assumptions. In addition to responding to the existing scholarly contributions, the article articulates a positive conception of the right to explanation, located in the text and purpose of the GDPR. The authors take a position that the right should be interpreted functionally, flexibly, and should, at a minimum, enable a data subject to exercise his or her rights under the GDPR and human rights law. Key takeaways: The first paper by Goodman and Flaxman states that GDPR creates a "right to explanation" but without any argument. The second paper is in response to the first paper, where Watcher et al. have published an extensive critique, arguing against the existence of such a right. The current paper, on the other hand, is partially concerned with responding to the arguments of Watcher et al. Paper 2: Interpretable Active Learning The paper tries to highlight how due to complex and opaque ML models, the process of active learning has also become opaque. Not much has been known about what specific trends and patterns, the active learning strategy may be exploring. The paper expands on explaining about LIME (Local Interpretable Model-agnostic Explanations framework) to provide explanations for active learning recommendations. The authors, Richard Phillips, Kyu Hyun Chang, and Sorelle A. Friedler, demonstrate uses of LIME in generating locally faithful explanations for an active learning strategy. Further, the paper shows how these explanations can be used to understand how different models and datasets explore a problem space over time. Key takeaways: The paper demonstrates how active learning choices can be made more interpretable to non-experts. It also discusses techniques that make active learning interpretable to expert labelers, so that queries and query batches can be explained and the uncertainty bias can be tracked via interpretable clusters. It showcases per-query explanations of uncertainty to develop a system that allows experts to choose whether to label a query. This will allow them to incorporate domain knowledge and their own interests into the labeling process. It introduces a quantified notion of uncertainty bias, the idea that an algorithm may be less certain about its decisions on some data clusters than others. Paper 3: Interventions over Predictions: Reframing the Ethical Debate for Actuarial Risk Assessment Actuarial risk assessments might be unduly perceived as a neutral way to counteract implicit bias and increase the fairness of decisions made within the criminal justice system, from pretrial release to sentencing, parole, and probation. However, recently, these assessments have come under increased scrutiny, as critics claim that the statistical techniques underlying them might reproduce existing patterns of discrimination and historical biases that are reflected in the data. The paper proposes that machine learning should not be used for prediction, but rather to surface covariates that are fed into a causal model for understanding the social, structural and psychological drivers of crime. The authors, Chelsea Barabas, Madars Virza, Karthik Dinakar, Joichi Ito (MIT), Jonathan Zittrain (Harvard),  propose an alternative application of machine learning and causal inference away from predicting risk scores to risk mitigation. Key takeaways: The paper gives a brief overview of how risk assessments have evolved from a tool used solely for prediction to one that is diagnostic at its core. The paper places a debate around risk assessment in a broader context. One can get a fuller understanding of the way these actuarial tools have evolved to achieve a varied set of social and institutional agendas. It argues for a shift away from predictive technologies, towards diagnostic methods that will help in understanding the criminogenic effects of the criminal justice system itself, as well as evaluate the effectiveness of interventions designed to interrupt cycles of crime. It proposes that risk assessments, when viewed as a diagnostic tool, can be used to understand the underlying social, economic and psychological drivers of crime. The authors also posit that causal inference offers the best framework for pursuing the goals to achieve a fair and ethical risk assessment tool.
Read more
  • 0
  • 0
  • 29623
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-how-to-perform-numeric-metric-aggregations-with-elasticsearch
Pravin Dhandre
22 Feb 2018
7 min read
Save for later

How to perform Numeric Metric Aggregations with Elasticsearch

Pravin Dhandre
22 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book Learning Elastic Stack 6.0 written by Pranav Shukla and Sharath Kumar M N . This book provides detailed coverage on fundamentals of each components of Elastic Stack, making it easy to search, analyze and visualize data across different sources in real-time.[/box] Today, we are going to demonstrate how to run numeric and statistical queries such as summation, average, count and various similar metric aggregations on Elastic Stack to serve a better analytics engine on your dataset. Metric aggregations   Metric aggregations work with numeric data, computing one or more aggregate metrics within the given context. The context could be a query, filter, or no query to include the whole index/type. Metric aggregations can also be nested inside other bucket aggregations. In this case, these metrics will be computed for each bucket in the bucket aggregations. We will start with simple metric aggregations without nesting them inside bucket aggregations. When we learn about bucket aggregations later in the chapter, we will also learn how to use metric aggregations inside bucket aggregations. We will learn about the following metric aggregations: Sum, average, min, and max aggregations Stats and extended stats aggregations Cardinality aggregation Let us learn about them one by one. Sum, average, min, and max aggregations Finding the sum of a field, the minimum value for a field, the maximum value for a field, or an average, are very common operations. For the people who are familiar with SQL, the query to find the sum would look like the following: SELECT sum(downloadTotal) FROM usageReport; The preceding query will calculate the sum of the downloadTotal field across all records in the table. This requires going through all records of the table or all records in the given context and adding the values of the given fields. In Elasticsearch, a similar query can be written using the sum aggregation. Let us understand the sum aggregation first. Sum aggregation Here is how to write a simple sum aggregation: GET bigginsight/_search { "aggregations": { 1 "download_sum": { 2 "sum": { 3 "field": "downloadTotal" 4 } } }, "size": 0 5 } The aggs or aggregations element at the top level should wrap any aggregation. Give a name to the aggregation; here we are doing the sum aggregation on the downloadTotal field and hence the name we chose is download_sum. You can name it anything. This field will be useful while looking up this particular aggregation's result in the response. We are doing a sum aggregation, hence the sum element. We want to do term aggregation on the downloadTotal field. Specify size = 0 to prevent raw search results from being returned. We just want aggregation results and not the search results in this case. Since we haven't specified any top level query elements, it matches all documents. We do not want any raw documents (or search hits) in the result. The response should look like the following: { "took": 92, ... "hits": { "total": 242836, 1 "max_score": 0, "hits": [] }, "aggregations": { 2 "download_sum": { 3 "value": 2197438700 4 } } } Let us understand the key aspects of the response. The key parts are numbered 1, 2, 3, and so on, and are explained in the following points: The hits.total element shows the number of documents that were considered or were in the context of the query. If there was no additional query or filter specified, it will include all documents in the type or index. Just like the request, this response is wrapped inside aggregations to indicate as Such. The response of the aggregation requested by us was named download_sum, hence we get our response from the sum aggregation inside an element with the same name. The actual value after applying the sum aggregation. The average, min, and max aggregations are very similar. Let's look at them briefly. Average aggregation The average aggregation finds an average across all documents in the querying context: GET bigginsight/_search { "aggregations": { "download_average": { 1 "avg": { 2 "field": "downloadTotal" } } }, "size": 0 } The only notable differences from the sum aggregation are as follows: We chose a different name, download_average, to make it apparent that the aggregation is trying to compute the average. The type of aggregation that we are doing is avg instead of the sum aggregation that we were doing earlier. The response structure is identical but the value field will now represent the average of the requested field. The min and max aggregations are the exactly same. Min aggregation Here is how we will find the minimum value of the downloadTotal field in the entire index/type: GET bigginsight/_search { "aggregations": { "download_min": { "min": { "field": "downloadTotal" } } }, "size": 0 } Let's finally look at max aggregation also. Max aggregation Here is how we will find the maximum value of the downloadTotal field in the entire index/type: GET bigginsight/_search { "aggregations": { "download_max": { "max": { "field": "downloadTotal" } } }, "size": 0 } These aggregations were really simple. Now let's look at some more advanced yet simple stats and extended stats aggregations. Stats and extended stats aggregations These aggregations compute some common statistics in a single request without having to issue multiple requests. This saves resources on the Elasticsearch side as well because the statistics are computed in a single pass rather than being requested multiple times. The client code also becomes simpler if you are interested in more than one of these statistics. Let's look at the stats aggregation first. Stats aggregation The stats aggregation computes the sum, average, min, max, and count of documents in a single pass: GET bigginsight/_search { "aggregations": { "download_stats": { "stats": { "field": "downloadTotal" } } }, "size": 0 } The structure of the stats request is the same as the other metric aggregations we have seen so far, so nothing special is going on here. The response should look like the following: { "took": 4, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "download_stats": { "count": 242835, "min": 0, "max": 241213, "avg": 9049.102065188297, "sum": 2197438700 } } } As you can see, the response with the download_stats element contains count, min, max, average, and sum; everything is included in the same response. This is very handy as it reduces the overhead of multiple requests and also simplifies the client code. Let us look at the extended stats aggregation. Extended stats Aggregation The extended stats aggregation returns a few more statistics in addition to the ones returned by the stats aggregation: GET bigginsight/_search { "aggregations": { "download_estats": { "extended_stats": { "field": "downloadTotal" } } }, "size": 0 } The response looks like the following: { "took": 15, "timed_out": false, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "download_estats": { "count": 242835, "min": 0, "max": 241213, "avg": 9049.102065188297, "sum": 2197438700, "sum_of_squares": 133545882701698, "variance": 468058704.9782911, "std_deviation": 21634.664429528162, "std_deviation_bounds": { "upper": 52318.43092424462, "lower": -34220.22679386803 } } } } It also returns the sum of squares, variance, standard deviation, and standard deviation Bounds. Cardinality aggregation Finding the count of unique elements can be done with the cardinality aggregation. It is similar to finding the result of a query such as the following: select count(*) from (select distinct username from usageReport) u; Finding the cardinality or the number of unique values for a specific field is a very common requirement. If you have click-stream from the different visitors on your website, you may want to find out how many unique visitors you got in a given day, week, or month. Let us understand how we find out the count of unique users for which we have network traffic data: GET bigginsight/_search { "aggregations": { "unique_visitors": { "cardinality": { "field": "username" } } }, "size": 0 } The cardinality aggregation response is just like the other metric aggregations: { "took": 110, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "unique_visitors": { "value": 79 } } } To summarize, we learned how to perform numerous metric aggregations on numeric datasets and easily deploy elasticsearch in building powerful analytics application. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 to examine the fundamentals of Elastic Stack in detail and start developing solutions for problems like logging, site search, app search, metrics and more.      
Read more
  • 0
  • 0
  • 23309

article-image-implementing-gradient-descent-algorithm-to-solve-optimization-problems
Sunith Shetty
22 Feb 2018
7 min read
Save for later

Implementing gradient descent algorithm to solve optimization problems

Sunith Shetty
22 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Rajdeep Dua and Manpreet Singh Ghotra titled Neural Network Programming with Tensorflow. In this book, you will learn to leverage the power of TensorFlow to train neural networks of varying complexities, without any hassle.[/box] Today we will focus on the gradient descent algorithm and its different variants. We will take a simple example of linear regression to solve the optimization problem. Gradient descent is the most successful optimization algorithm. As mentioned earlier, it is used to do weights updates in a neural network so that we minimize the loss function. Let's now talk about an important neural network method called backpropagation, in which we firstly propagate forward and calculate the dot product of inputs with their corresponding weights, and then apply an activation function to the sum of products which transforms the input to an output and adds non linearities to the model, which enables the model to learn almost any arbitrary functional mappings. Later, we back propagate in the neural network, carrying error terms and updating weights values using gradient descent, as shown in the following graph: Different variants of gradient descent Standard gradient descent, also known as batch gradient descent, will calculate the gradient of the whole dataset but will perform only one update. Therefore, it can be quite slow and tough to control for datasets which are extremely large and don't fit in the memory. Let's now look at algorithms that can solve this problem. Stochastic gradient descent (SGD) performs parameter updates on each training example, whereas mini batch performs an update with n number of training examples in each batch. The issue with SGD is that, due to the frequent updates and fluctuations, it eventually complicates the convergence to the accurate minimum and will keep exceeding due to regular fluctuations. Mini-batch gradient descent comes to the rescue here, which reduces the variance in the parameter update, leading to a much better and stable convergence. SGD and mini-batch are used interchangeably. Overall problems with gradient descent include choosing a proper learning rate so that we avoid slow convergence at small values, or divergence at larger values and applying the same learning rate to all parameter updates wherein if the data is sparse we might not want to update all of them to the same extent. Lastly, is dealing with saddle points. Algorithms to optimize gradient descent We will now be looking at various methods for optimizing gradient descent in order to calculate different learning rates for each parameter, calculate momentum, and prevent decaying learning rates. To solve the problem of high variance oscillation of the SGD, a method called momentum was discovered; this accelerates the SGD by navigating along the appropriate direction and softening the oscillations in irrelevant directions. Basically, it adds a fraction of the update vector of the past step to the current update vector. Momentum value is usually set to .9. Momentum leads to a faster and stable convergence with reduced oscillations. Nesterov accelerated gradient explains that as we reach the minima, that is, the lowest point on the curve, momentum is quite high and it doesn't know to slow down at that point due to the large momentum which could cause it to miss the minima entirely and continue moving up. Nesterov proposed that we first make a long jump based on the previous momentum, then calculate the gradient and then make a correction which results in a parameter update. Now, this update prevents us to go too fast and not miss the minima, and makes it more responsive to changes. Adagrad allows the learning rate to adapt based on the parameters. Therefore, it performs large updates for infrequent parameters and small updates for frequent parameters. Therefore, it is very well-suited for dealing with sparse data. The main flaw is that its learning rate is always decreasing and decaying. Problems with decaying learning rates are solved using AdaDelta. AdaDelta solves the problem of decreasing learning rate in AdaGrad. In AdaGrad, the learning rate is computed as one divided by the sum of square roots. At each stage, we add another square root to the sum, which causes the denominator to decrease constantly. Now, instead of summing all prior square roots, it uses a sliding window which allows the sum to decrease. Adaptive Moment Estimation (Adam) computes adaptive learning rates for each parameter. Like AdaDelta, Adam not only stores the decaying average of past squared gradients but additionally stores the momentum change for each parameter. Adam works well in practice and is one of the most used optimization methods today. The following two images (image credit: Alec Radford) show the optimization behavior of optimization algorithms described earlier. We see their behavior on the contours of a loss surface over time. Adagrad, RMsprop, and Adadelta almost quickly head off in the right direction and converge fast, whereas momentum and NAG are headed off-track. NAG is soon able to correct its course due to its improved responsiveness by looking ahead and going to the minimum. The second image displays the behavior of the algorithms at a saddle point. SGD, Momentum, and NAG find it challenging to break symmetry, but slowly they manage to escape the saddle point, whereas Adagrad, Adadelta, and RMsprop head down the negative slope, as can seen from the following image: Which optimizer to choose In the case that the input data is sparse or if we want fast convergence while training complex neural networks, we get the best results using adaptive learning rate methods. We also don't need to tune the learning rate. For most cases, Adam is usually a good choice. Optimization with an example Let's take an example of linear regression, where we try to find the best fit for a straight line through a number of data points by minimizing the squares of the distance from the line to each data point. This is why we call it least squares regression. Essentially, we are formulating the problem as an optimization problem, where we are trying to minimize a loss function. Let's set up input data and look at the scatter plot: #  input  data xData  =  np.arange(100,  step=.1) yData  =  xData  +  20  *  np.sin(xData/10) Define the data size and batch size: #  define  the  data  size  and  batch  size nSamples  =  1000 batchSize  =  100 We will need to resize the data to meet the TensorFlow input format, as follows: #  resize  input  for  tensorflow xData  =  np.reshape(xData,  (nSamples,  1)) yData  =  np.reshape(yData,  (nSamples,  1)) The following scope initializes the weights and bias, and describes the linear model and loss function: with tf.variable_scope("linear-regression-pipeline"): W  =  tf.get_variable("weights",  (1,1), initializer=tf.random_normal_initializer()) b  =  tf.get_variable("bias",   (1,  ), initializer=tf.constant_initializer(0.0)) # model yPred  =  tf.matmul(X,  W)  +  b # loss  function loss  =  tf.reduce_sum((y  -  yPred)**2/nSamples) We then set optimizers for minimizing the loss: # set the optimizer #optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) #optimizer = tf.train.AdamOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.AdadeltaOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.AdagradOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.MomentumOptimizer(learning_rate=.001, momentum=0.9).minimize(loss) #optimizer = tf.train.FtrlOptimizer(learning_rate=.001).minimize(loss) optimizer = tf.train.RMSPropOptimizer(learning_rate=.001).minimize(loss) We then select the mini batch and run the optimizers errors = [] with tf.Session() as sess: # init variables sess.run(tf.global_variables_initializer()) for _ in range(1000): # select mini batch indices = np.random.choice(nSamples, batchSize) xBatch, yBatch = xData[indices], yData[indices] # run optimizer _, lossVal = sess.run([optimizer, loss], feed_dict={X: xBatch, y: yBatch}) errors.append(lossVal) plt.plot([np.mean(errors[i-50:i]) for i in range(len(errors))]) plt.show() plt.savefig("errors.png") The output of the preceding code is as follows: We also get a sliding curve, as follows: We learned optimization is a complicated subject and a lot depends on the nature and size of our data. Also, optimization depends on weight matrices. A lot of these optimizers are trained and tuned for tasks like image classification or predictions. However, for custom or new use cases, we need to perform trial and error to determine the best solution. To know more about how to build and optimize neural networks using TensorFlow, do checkout this book Neural Network Programming with Tensorflow.  
Read more
  • 0
  • 0
  • 78900

article-image-build-data-streams-ibm-spss-modeler
Amey Varangaonkar
22 Feb 2018
7 min read
Save for later

How to build data streams in IBM SPSS Modeler

Amey Varangaonkar
22 Feb 2018
7 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book IBM SPSS Modeler Essentials, co-authored by Keith McCormick and Jesus Salcedo. This book gives you a quick overview of the fundamental concepts of data mining and how to put them to practical use with the help of SPSS Modeler.[/box] SPSS Modeler allows users to mine data visually on the stream canvas. This means that you will not be writing code for your data mining projects; instead you will be placing nodes on the stream canvas. Remember that nodes represent operations to be carried out on the data. So once nodes have been placed on the stream canvas, they need to be linked together to form a stream. A stream represents the flow of data going through a number of operations (nodes). The following diagram is an example of nodes on the canvas, as well as a stream: Given that you will spend a lot of time building streams, in this section you will learn the most efficient ways of manipulating nodes to create a stream. Mouse buttons When building streams, mouse buttons are used extensively so that nodes can be brought onto the canvas, connected, edited, and so on. When building streams within Modeler, mouse buttons are used in the following ways: The left button is used for selecting, placing, and positioning nodes on the stream Canvas The right button is used for invoking context (pop-up) menus that allow for editing, connecting, renaming, deleting, and running nodes The middle button (optional) is used for connecting and disconnecting nodes Adding nodes To begin a new stream, a node from the Sources palette needs to be placed on the stream canvas. There are three ways to add nodes to a stream from a palette: Method one: Click on palette and then on stream: Click on the Sources palette. Click on the Var. File node. This will cause the icon to be highlighted. Move the cursor over the stream canvas. Click anywhere in the stream canvas. A copy of the selected node now appears on the stream canvas. This node represents the action of reading data into Modeler from a delimited text data file. If you wish to move the node within the stream canvas, select it by clicking on the node, and while holding the left mouse button down, drag the node to a new position. Method two: Drag and drop: Now go back to the Sources palette. Click on the Statistics File node and drag and drop this node onto the canvas. The Statistics File node represents the action of reading data into Modeler from an IBM SPSS Statistics data file. Method three: Double-click: Go back to the Sources palette one more time. Double click on the Database node. The Database node represents the action of reading data into Modeler from an ODBC compliant database. Editing nodes Once a node has been brought onto the stream canvas, typically at this point you will want to edit the node so that you can specify which fields, cases, or files you want the node to apply to. There are two ways to edit a node. Method one: Right-click on a node: Right-click on the Var. File node: Notice that there are many things you can do within this context menu. You can edit, add comments, copy, delete a node, connect nodes, and so on. Most often you will probably either edit the node or connect nodes. Method two: Double-click on a node: Double-click on the Var. File node. This bypasses the context menu we saw previously, and goes directly into the node itself so we can edit it. Deleting nodes There will be times when you will want to delete a node that you have on the stream canvas. There are two ways to delete a node. Method one: Right-click on a node: Right-click on the Database File node. Select Delete. The node is now deleted. Method two: Use the Delete button from the keyboard: Click on the Statistics File node. Click on the Delete button on the keyboard. Building a stream When two or more nodes have been placed on the stream canvas, they need to be connected to produce a stream. This can be thought of as representing the flow of data through the nodes. To demonstrate this, we will place a Table node on the stream canvas next to the Var. File node. The Table node presents data in a table format. Click the Output palette to activate it. Click on the Table node. Place this node to the right of the Var. File node by clicking in the stream canvas: At this point, we now have two nodes on the stream canvas, however, we technically do not have a stream because the nodes are not speaking to each other (that is, they are not connected). Connecting nodes In order for nodes to work together, they must be connected. Connecting nodes allows you to bring data into Modeler, explore the data, manipulate the data (to either clean it up or create additional fields), build a model, evaluate the model, and ultimately score the data. There are three main ways to connect nodes to form a stream that is, double-clicking, using the middle mouse button, or manually: Method one: Double-click. The simplest way to form a stream is to double-click on nodes on a palette. This method automatically connects the new node to the currently selected node on the stream canvas: Select the Var. File node that is on the stream canvas Double-click the Table node from the Output palette This action automatically connects the Table node to the existing Var. File node, and a connecting arrow appears between the nodes. The head of the arrow indicates the direction of the data flow. Method two: Manually. To manually connect two nodes: Bring a Table node onto the canvas. Right-click on the Var. File node. Select Connect from the context menu. Click the Table node. Method three: Middle mouse button. To use the middle mouse button: Bring a Table node onto the canvas. Use the middle mouse button to click on the Var. File node. While holding the middle mouse button down, drag the cursor over to the Table node. Release the middle mouse button. Deleting connections When you know that you are no longer going to use a node, you can delete it. Often, though, you may not want to delete a node; instead you might want to delete a connection. Deleting a node completely gets rid of the node. Deleting a connection allows you to keep a node with all the edits you have done, but for now the unconnected node will not be part of the stream. Nodes can be disconnected in several ways: Method one: Delete the connecting arrow: Right-click on the connecting arrow. Click Delete Connection. Method two: Right-click on a node: Right-click on one of the nodes that has a connection. Select Disconnect from the Context menu. Method three: Double-clicking: Double-click with the middle mouse button on a node that has a connection. All connections to this node will be severed, but the connections to neighboring nodes will be intact. Thus, we saw it’s fairly easy to build and manage data streams in SPSS Modeler. If you found the above excerpt useful, make sure to check out our book IBM SPSS Modeler Essentials for more tips and tricks on effective data mining.    
Read more
  • 0
  • 0
  • 5473

article-image-combine-data-files-within-ibm-spss-modeler
Amey Varangaonkar
22 Feb 2018
6 min read
Save for later

How to combine data files within IBM SPSS Modeler

Amey Varangaonkar
22 Feb 2018
6 min read
[box type="note" align="" class="" width=""]The following extract is taken from the book IBM SPSS Modeler Essentials, written by Keith McCormick and Jesus Salcedo. SPSS Modeler is one of the popularly used enterprise tools for data mining and predictive analytics. [/box] In this article, we will explore how SPSS Modeler can be effectively used to combine different file types for efficient data modeling. In many organizations, different pieces of information for the same individuals are held in separate locations. To be able to analyze such information within Modeler, the data files must be combined into one single file. The Merge node joins two or more data sources so that information held for an individual in different locations can be analyzed collectively. The following diagram shows how the Merge node can be used to combine two separate data files that contain different types of information: Like the Append node, the Merge node is found in the Record Ops palette. This node takes multiple data sources and creates a single source containing all or some of the input fields. Let's go through an example of how to use the Merge node to combine data files: Open the Merge stream. The Merge stream contains the files we previously appended, as well as the main data file we were working with in earlier chapters. 2. Place a Merge node from the Record Ops palette on the canvas. 3. Connect the last Reclassify node to the Merge node. 4. Connect the Filter node to the Merge node. [box type="info" align="" class="" width=""]Like the Append node, the order in which data sources are connected to the Merge node impacts the order in which the sources are displayed. The fields of the first source connected to the Merge node will appear first, followed by the fields of the second source connected to the Merge node, and so on.[/box] 5. Connect the Merge node to a Table node: 6. Edit the Merge node. Since the Merge node can cope with a variety of different situations, the Merge tab allows you to specify the merging method. There are four methods for merging: Order: It joins the first record in the first dataset with the first record in the second dataset, and so on. If any of the datasets run out of records, no further output records are produced. This method can be dangerous if there happens to be any cases that are missing from a file, or if files have been sorted differently. Keys: It is the most commonly used method, used when records that have the same value in the field(s) defined as the key are merged. If multiple records contain the same value on the key field, all possible merges are returned. Condition: It joins records from files that meet a specified condition. Ranked condition: It specifies whether each row pairing in the primary dataset and all secondary datasets are to be merged; use the ranking expression to sort any multiple matches from low to high order. Let's combine these files. To do this: Set Merge Method to Keys. Fields contained in all input sources appear in the Possible keys list. To identify one of more fields as the key field(s), move the selected field into the Keys for merge list. In our case, there are two fields that appear in both files, ID and Year. 2. Select ID in the Possible keys list and place it into the Keys for merge list: There are five major methods of merging using a key field: Include only matching records (inner join) merges only complete records, that is, records; that are available in all datasets. Include matching and non-matching records (full outer join) merges records that appear in any of the datasets; that is, the incomplete records are still retained. The undefined value ($null$) is added to the missing fields and included in the output. Include matching and selected non-matching records (partial outerjoin) performs left and right outer-joins. All records from the specified file are retained, along with only those records from the other file(s) that match records in the specified file on the key field(s). The Select... button allows you to designate which file is to contribute incomplete records. Include records in first dataset not matching any others (anti-join) provides an easy way of identifying records in a dataset that do not have records with the same key values in any of the other datasets involved in the merge. This option only retains records from the dataset that match with no other records. Combine duplicate key fields is the final option in this dialog, and it deals with the problem of duplicate field names (one from each dataset) when key fields are used. This option ensures that there is only one output field with a given name, and this is enabled by default. The Filter tab The Filter tab lists the data sources involved in the merge, and the ordering of the sources determines the field ordering of the merged data. Here, you can rename and remove fields. Earlier, we saw that the field Year appeared in both datasets; here we can remove one version of this field (we could also rename one version of the field to keep both): Click on the arrow next to the second Year field: The second Year field will no longer appear in the combined data file. The Optimization tab The Optimization tab provides two options that allow you to merge data more efficiently when one input dataset is significantly larger than the other datasets, or when the data is already presorted by all or some of the key fields that you are using to merge: Click OK. Run the Table: All of these files have now been combined. The resulting table should have 44 fields and 143,531 records. We saw how the Merge node is used to join data files that contain different information for the same records. If you found this post useful, make sure to check out IBM SPSS Modeler Essentials for more information on leveraging SPSS Modeler to get faster and efficient insights from your data.  
Read more
  • 0
  • 0
  • 11713
article-image-working-with-kafka-streams
Amarabha Banerjee
22 Feb 2018
6 min read
Save for later

Working with Kafka Streams

Amarabha Banerjee
22 Feb 2018
6 min read
This article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book will simplify real-time data processing by leveraging Apache Kafka 1.0. In today’s tutorial we are going to discuss how to work with Apache Kafka Streams efficiently. In the data world, a stream is linked to the most important abstractions. A stream depicts a continuously updating and unbounded process. Here, unbounded means unlimited size. By definition, a stream is a fault-tolerant, replayable, and ordered sequence of immutable data records. A data record is defined as a key-value pair. Before we proceed, some concepts need to be defined: Stream processing application: Any program that utilizes the Kafka streams library is known as a stream processing application. Processor topology: This is a topology that defines the computational logic of the data processing that a stream processing application requires to be performed. A topology is a graph of stream processors (nodes) connected by streams (edges).  There are two ways to define a topology: Via the low-level processor API Via the Kafka streams DSL Stream processor: This is a node present in the processor topology. It represents a processing step in a topology and is used to transform data in streams. The standard operations—filter, join, map, and aggregations—are examples of stream processors available in Kafka streams. Windowing: Sometimes, data records are divided into time buckets by a stream processor to window the stream by time. This is usually required for aggregation and join operations. Join: When two or more streams are merged based on the keys of their data records, a new stream is generated. The operation that generates this new stream is called a join. A join over record streams is usually required to be performed on a windowing basis. Aggregation: A new stream is generated by combining multiple input records into a single output record, by taking one input stream. The operation that creates this new stream is known as aggregation. Examples of aggregations are sums and counts. Setting up the project This recipe sets the project to use Kafka streams in the Treu application project. Getting ready The project generated in the first four chapters is needed. How to do it Open the build.gradle file on the Treu project generated in Chapter 4, Message Enrichment, and add these lines: apply plugin: 'java' apply plugin: 'application' sourceCompatibility = '1.8' mainClassName = 'treu.StreamingApp' repositories { mavenCentral() } version = '0.1.0' dependencies { compile 'org.apache.kafka:kafka-clients:1.0.0' compile 'org.apache.kafka:kafka-streams:1.0.0' compile 'org.apache.avro:avro:1.7.7' } jar { manifest { attributes 'Main-Class': mainClassName } from { configurations.compile.collect { it.isDirectory() ? it : zipTree(it) } } { exclude "META-INF/*.SF" exclude "META-INF/*.DSA" exclude "META-INF/*.RSA" } } To rebuild the app, from the project root directory, run this command: $ gradle jar The output is something like: ... BUILD SUCCESSFUL Total time: 24.234 secs As the next step, create a file called StreamingApp.java in the src/main/java/treu directory with the following contents: package treu; import org.apache.kafka.streams.StreamsBuilder; import org.apache.kafka.streams.Topology; import org.apache.kafka.streams.KafkaStreams; import org.apache.kafka.streams.StreamsConfig; import org.apache.kafka.streams.kstream.KStream; import java.util.Properties; public class StreamingApp { public static void main(String[] args) throws Exception { Properties props = new Properties(); props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streaming_app_id");// 1 props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); //2 StreamsConfig config = new StreamsConfig(props); // 3 StreamsBuilder builder = new StreamsBuilder(); //4 Topology topology = builder.build(); KafkaStreams streams = new KafkaStreams(topology, config); KStream<String, String> simpleFirstStream = builder.stream("src-topic"); //5 KStream<String, String> upperCasedStream = simpleFirstStream.mapValues(String::toUpperCase); //6 upperCasedStream.to("out-topic"); //7 System.out.println("Streaming App Started"); streams.start(); Thread.sleep(30000); //8 System.out.println("Shutting down the Streaming App"); streams.close(); } } How it works Follow the comments in the code: In line //1, the APPLICATION_ID_CONFIG is an identifier for the app inside the broker In line //2, the BOOTSTRAP_SERVERS_CONFIG specifies the broker to use In line //3, the StreamsConfig object is created, it is built with the properties specified In line //4, the StreamsBuilder object is created, it is used to build a topology In line //5, when KStream is created, the input topic is specified In line //6, another KStream is created with the contents of the src-topic but in uppercase In line //7, the uppercase stream should write the output to out-topic In line //8, the application will run for 30 seconds Running the streaming application In the previous recipe, the first version of the streaming app was coded. Now, in this recipe, everything is compiled and executed. Getting ready The execution of the previous recipe of this chapter is needed. How to do it The streaming app doesn't receive arguments from the command line: To build the project, from the treu directory, run the following command: $ gradle jar If everything is OK, the output should be: ... BUILD SUCCESSFUL Total time: … To run the project, we have four different command-line windows. The following diagram shows what the arrangement of command-line windows should look like: In the first command-line Terminal, run the control center: $ <confluent-path>/bin/confluent start In the second command-line Terminal, create the two topics needed: $ bin/kafka-topics --create --topic src-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1 $ bin/kafka-topics --create --topic out-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1 In that command-line Terminal, start the producer: $ bin/kafka-console-producer --broker-list localhost:9092 --topic src-topic This window is where the input messages are typed. In the third command-line Terminal, start a consumer script listening to outtopic: $ bin/kafka-console-consumer --bootstrap-server localhost:9092 -- from-beginning --topic out-topic In the fourth command-line Terminal, start up the processing application. Go the project root directory (where the Gradle jar command was executed) and run: $ java -jar ./build/libs/treu-0.1.0.jar localhost:9092 Go to the second command-line Terminal (console-producer) and send the following three messages (remember to press Enter between messages and execute each one in just one line): $> Hello [Enter] $> Kafka [Enter] $> Streams [Enter] The messages typed in console-producer should appear uppercase in the outtopic console consumer window: > HELLO > KAFKA > STREAMS We discussed about the Apache Kafka streams and how to get up and running with it. If you liked this post, be sure to check out Apache Kafka 1.0 Cookbook which consists of more useful recipes to work with Apache Kafka installation.  
Read more
  • 0
  • 0
  • 17880

article-image-handle-missing-data-ibm-spss-modeler
Amey Varangaonkar
21 Feb 2018
8 min read
Save for later

How to handle missing data in IBM SPSS Modeler

Amey Varangaonkar
21 Feb 2018
8 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book IBM SPSS Modeler Essentials written by Keith McCormick and Jesus Salcedo. This book gets you up and running with the fundamentals of SPSS Modeler, a premium tool for data mining and predictive analytics.[/box] In today’s tutorial we will demonstrate how easy it is to work with missing values in a dataset using the SPSS Modeler. Missing data is different than other topics in data modeling that you cannot choose to ignore . This is because failing to make a choice just means you are using the default option for a procedure, which most of the time is not optimal. In fact, it is important to remember that every model deals with missing data in a certain way, and some modeling techniques handle missing data better than others. In SPSS Modeler, there are four types of missing data: Type of missing data Definition $Null$ value Applies only to numeric fields. This is a cell that is empty or has an illegal value White space Applies only to string fields. This is a cell that is empty or has spaces. Empty string Applies only to string fields. This is a cell that is empty. Empty string is a subset of white space Blank value This is predefined code, and it applies to any type of field The first step in dealing with missing data is to assess the type and amount of missing data for each field. Consider whether there is a pattern as to why data might be missing. This can help determine if missing values could have affected responses. Only then can we decide how to handle it. There are two problems associated with missing data, and these affect the quantity and quality of the data: Missing data reduces sample size (quantity) Responders may be different from non-responders (quality—there could be biased results) Ways to address missing data There are three ways to address missing data: Remove fields Remove cases Impute missing values It can be necessary at times to remove fields with a large proportion of missing values. The easiest way to remove fields is to use a Filter node (discussed later in the book), however you can also use the Data Audit node to do this. [box type="info" align="" class="" width=""]Note that in some cases missing data can be predictive of behavior, so it is important to assess the importance of a variable before removing a field.[/box] In some situations, it may be necessary to remove cases instead of fields. For example, you may be developing a predictive model to predict customers' purchasing behavior and you simply do not have enough information concerning new customers. The easiest way to remove cases would be to use a Select node (discussed in the next chapter); however, you can also use the Data Audit node to do this. Imputing missing values implies replacing values for fields. However, some people do not estimate values for categorical fields because it does not seem right. In general, it is easier to estimate missing values for numeric fields, such as age, where often analysts will use the mean, median, or mode. [box type="info" align="" class="" width=""]Note that it is not a good idea to estimate missing data if you are missing a large percentage of information for that field, because estimates will not be accurate. Typically, we try not to impute more than 5% of values.[/box] To close out of the Data Audit node: Click OK to return to the stream canvas Defining missing values in the Type node When working with missing data, the first thing you need to do is define the missing data so that Modeler knows there is missing data, otherwise Modeler will think that the missing data is another value for a field (which, in some situations, it is, as in our dataset, but quite often this is not the case). Although the Data Audit node provides a report of missing values, blank values need to be defined within a Type node (or Type tab of a source node) in order for these to be identified by the Data Audit node. The Type tab (or node) is the only place where users can define missing values (Missing column). [box type="info" align="" class="" width=""]Note that in the Type node, blank values and $null$ values are not shown; however, empty strings and white space are depicted by "" or " ".[/box] To define blank values: Edit the Var.File node. Click on the Types tab. Click on the Missing cell for the field Region. Select Specify in the Missing column. Click Define blanks. Selecting Define blanks chooses Null and White space (remember, Empty String is a subset of White space, so it is also selected), and in this way these types of missing data are specified. To specify a predefined code, or a blank value, you can add each individual value to a separate cell in the Missing values area, or you can enter a range of numeric values if they are consecutive. 6. Type "Not applicable" in the first Missing values cell. 7. Hit Enter: We have now specified that "Not applicable" is a code for missing data for the field Region. 8. Click OK. In our dataset, we will only define one field as having missing data. 9. Click on the Clear Values button. 10. Click on the Read Values button: The asterisk indicates that missing values have been defined for the field Region. Now Not applicable is no longer considered a valid value for the field Region, but it will still be shown in graphs and other output. However, models will now treat the category Not applicable as a missing value. 11. Click OK. Imputing missing values with the Data Audit node As we have seen, the Data Audit node allows you to identify missing values so that you can get a sense of how much missing data you have. However, the Data Audit node also allows you to remove fields or cases that have missing data, as well as providing several options for data imputation: Rerun the Data Audit node. Note that the field Region only has 15,774 valid cases now, because we have correctly identified that the Not applicable category was a predefined code for missing data. 2. Click on the Quality tab. We are not going to impute any missing values in this example because it is not necessary, but we are going to show you some of the options, since these will be useful in other situations. To impute missing values you first need to specify when you want to impute missing values. For example: 3. Click in the Impute when cell for the field Region. 4. Select the Blank & Null Values. Now you need to specify how the missing values will be imputed. 5. Click in the Impute Method cell for the field Region. 6. Select Specify. In this dialog box, you can specify which imputation method you want to use, and once you have chosen a method, you can then further specify details about the imputation. There are several imputation methods: Fixed uses the same value for all cases. This fixed value can be a constant, the mode, the mean, or a midpoint of the range (the options will vary depending on the measurement level of the field). Random uses a random (different) value based on a normal or uniform distribution. This allows for there to be variation for the field with imputed values. Expression allows you to create your own equation to specify missing values. Algorithm uses a value predicted by a C&R Tree model. We are not going to impute any values now so click Cancel. If we had selected an imputation method, we would then: Click on the field Region to select it. Click on the Generate menu. The Generate menu of the Data Audit node allows you to remove fields, remove cases, or impute missing values, explained as follows: Missing Values Filter Node: This removes fields with too much missing data, or keeps fields with missing data so that you can investigate them further Missing Values Select Node: This removes cases with missing data, or keeps cases with missing data so that you can further investigate Missing Values SuperNode: This imputes missing values: If we were going to impute values, we would then click Missing Values SuperNode. In this way you can impute missing values using SPSS Modeler, and it makes your analysis a lot more easier. If you found our post useful, make sure to check out our book IBM SPSS Modeler Essentials, for more information on data mining and generating hidden insights using the popular SPSS Modeler tool.  
Read more
  • 0
  • 0
  • 15133

article-image-installing-tensorflow-in-windows-ubuntu-and-mac-os
Amarabha Banerjee
21 Feb 2018
7 min read
Save for later

Installing TensorFlow in Windows, Ubuntu and Mac OS

Amarabha Banerjee
21 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is taken from the book Machine Learning with Tensorflow 1.x, written by Quan Hua, Shams Ul Azeem and Saif Ahmed. This book will help tackle common commercial machine learning problems with Google’s TensorFlow 1.x library.[/box] Today, we shall explore the basics of getting started with TensorFlow, its installation and configuration process. The proliferation of large public datasets, inexpensive GPUs, and open-minded developer culture has revolutionized machine learning efforts in recent years. Training data, the lifeblood of machine learning, has become widely available and easily consumable in recent years. Computing power has made the required horsepower available to small businesses and even individuals. The current decade is incredibly exciting for data scientists. Some of the top platforms used in the industry include Caffe, Theano, and Torch. While the underlying platforms are actively developed and openly shared, usage is limited largely to machine learning practitioners due to difficult installations, non-obvious configurations, and difficulty with productionizing solutions. TensorFlow has one of the easiest installations of any platform, bringing machine learning capabilities squarely into the realm of casual tinkerers and novice programmers. Meanwhile, high-performance features, such as—multiGPU support, make the platform exciting for experienced data scientists and industrial use as well. TensorFlow also provides a reimagined process and multiple user-friendly utilities, such as TensorBoard, to manage machine learning efforts. Finally, the platform has significant backing and community support from the world's largest machine learning powerhouse--Google. All this is before even considering the compelling underlying technical advantages, which we'll dive into later. Installing TensorFlow TensorFlow conveniently offers several types of installation and operates on multiple operating systems. The basic installation is CPU-only, while more advanced installations unleash serious horsepower by pushing calculations onto the graphics card, or even to multiple graphics cards. We recommend starting with a basic CPU installation at first. More complex GPU and CUDA installations will be discussed in Appendix, Advanced Installation. Even with just a basic CPU installation, TensorFlow offers multiple options, which are as follows: A basic Python pip installation A segregated Python installation via Virtualenv A fully segregated container-based installation via Docker Ubuntu installation Ubuntu is one of the best Linux distributions for working with Tensorflow. We highly recommend that you use an Ubuntu machine, especially if you want to work with GPU. We will do most of our work on the Ubuntu terminal. We will begin with installing pythonpip and python-dev via the following command: sudo apt-get install python-pip python-dev A successful installation will appear as follows: If you find missing packages, you can correct them via the following command: sudo apt-get update --fix-missing Then, you can continue the python and pip installation. We are now ready to install TensorFlow. The CPU installation is initiated via the following command: sudo pip install tensorflow A successful installation will appear as follows: macOS installation If you use Python, you will probably already have the Python package installer, pip. However, if not, you can easily install it using the easy_install pip command. You'll note that we actually executed sudo easy_install pip—the sudo prefix was required because the installation requires administrative rights. We will make the fair assumption that you already have the basic package installer, easy_install, available; if not, you can install it from https://pypi.python.org/pypi/setuptools. A successful installation will appear as shown in the following screenshot: Next, we will install the six package: sudo easy_install --upgrade six A successful installation will appear as shown in the following screenshot: Surprisingly, those are the only two prerequisites for TensorFlow, and we can now install the core platform. We will use the pip package installer mentioned earlier and install TensorFlow directly from Google's site. The most recent version at the time of writing this book is v1.3, but you should change this to the latest version you wish to use: sudo pip install tensorflow The pip installer will automatically gather all the other required dependencies. You will see each individual download and installation until the software is fully installed. A successful installation will appear as shown in the following screenshot: That's it! If you were able to get to this point, you can start to train and run your first model. Skip to Chapter 2, Your First Classifier, to train your first model. macOS X users wishing to completely segregate their installation can use a VM instead, as described in the Windows installation. Windows installation As we mentioned earlier, TensorFlow with Python 2.7 does not function natively on Windows. In this section, we will guide you through installing TensorFlow with Python 3.5 and set up a VM with Linux if you want to use TensorFlow with Python 2.7. First, we need to install Python 3.5.x or 3.6.x 64-bit from the following links: https://www.python.org/downloads/release/python-352/ https://www.python.org/downloads/release/python-362/ Make sure that you download the 64-bit version of Python where the name of the installation has amd64, such as python-3.6.2-amd64.exe. The Python 3.6.2 installation looks like this: We will select Add Python 3.6 to PATH and click Install Now. The installation process will complete with the following screen: We will click the Disable path length limit and then click Close to finish the Python installation. Now, let's open the Windows PowerShell application under the Windows menu. We will install the CPU-only version of Tensorflow with the following command: pip3 install tensorflow. The result of the installation will look like this: Congratulations, you can now use TensorFlow on Windows with Python 3.5.x or 3.6.x support. In the next section, we will show you how to set up a VM to use TensorFlow with Python 2.7. However, you can skip to the Test installation section of Chapter 2, Your First Classifier, if you don't need Python 2.7. Now, we will show you how to set up a VM with Linux to use TensorFlow with Python 2.7. We recommend the free VirtualBox system available at https://www.virtualbox.org/wiki/Downloads. The latest stable version at the time of writing is v5.0.14, available at the following URL: http:/ / download. virtualbox. org/ virtualbox/ 5. 1. 28/ VirtualBox- 5. 1. 28- 117968- Win. exe A successful installation will allow you to run the Oracle VM VirtualBox Manager dashboard, which looks like this: Testing the installation In this section, we will use TensorFlow to compute a simple math operation. First, open your terminal on Linux/macOS or Windows PowerShell in Windows. Now, we need to run python to use TensorFlow with the following command: python Enter the following program in the Python shell: import tensorflow as tf a = tf.constant(1.0) b = tf.constant(2.0) c = a + b sess = tf.Session() print(sess.run(c)) The result will look like the following screen where 3.0 is printed at the end: We covered TensorFlow installation on the three major operating systems, so that you are up and running with the platform. Windows users faced an extra challenge, as TensorFlow on Windows only supports Python 3.5.x or Python 3.6.x 64-bit version. However, even Windows users should now be up and running. Further get a detailed understanding of implementing Tensorflow with contextual examples in this post. If you liked this article, be sure to check out Machine Learning with Tensorflow 1.x which will help you take up any challenge you may face while implementing TensorFlow 1.x in your machine learning environment.  
Read more
  • 0
  • 0
  • 13492
article-image-create-conversational-assistant-chatbot-using-python
Savia Lobo
21 Feb 2018
5 min read
Save for later

How to create a conversational assistant or chatbot using Python

Savia Lobo
21 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Natural Language Processing with Python Cookbook written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti. This book includes unique recipes to teach various aspects of performing Natural Language Processing with NLTK—the leading Python platform for the task.[/box] Today we will learn to create a conversational assistant or chatbot using Python programming language. Conversational assistants or chatbots are not very new. One of the foremost of this kind is ELIZA, which was created in the early 1960s and is worth exploring. In order to successfully build a conversational engine, it should take care of the following things: 1. Understand the target audience 2. Understand the natural language in which communication happens.  3. Understand the intent of the user 4. Come up with responses that can answer the user and give further clues NLTK has a module, nltk.chat, which simplifies building these engines by providing a generic framework. Let's see the available engines in NLTK: Engines Modules Eliza nltk.chat.eliza Python module Iesha nltk.chat.iesha Python module Rude nltk.chat.rudep ython module Suntsu Suntsu nltk.chat.suntsu module Zen nltk.chat.zen module In order to interact with these engines we can just load these modules in our Python program and invoke the demo() function. This recipe will show us how to use built-in engines and also write our own simple conversational engine using the framework provided by the nltk.chat module. Getting ready You should have Python installed, along with the nltk library. Having an understanding of regular expressions also helps. How to do it...    Open atom editor (or your favorite programming editor).    Create a new file called Conversational.py.    Type the following source code:    Save the file.    Run the program using the Python interpreter.    You will see the following output: How it works... Let's try to understand what we are trying to achieve here. import nltk This instruction imports the nltk library into the current program. def builtinEngines(whichOne): This instruction defines a new function called builtinEngines that takes a string parameter, whichOne: if whichOne == 'eliza': nltk.chat.eliza.demo() elif whichOne == 'iesha': nltk.chat.iesha.demo() elif whichOne == 'rude': nltk.chat.rude.demo() elif whichOne == 'suntsu': nltk.chat.suntsu.demo() elif whichOne == 'zen': nltk.chat.zen.demo() else: print("unknown built-in chat engine {}".format(whichOne)) These if, elif, else instructions are typical branching instructions that decide which chat engine's demo() function is to be invoked depending on the argument that is present in the whichOne variable. When the user passes an unknown engine name, it displays a message to the user that it's not aware of this engine. It's a good practice to handle all known and unknown cases also; it makes our programs more robust in handling unknown situations def myEngine():. This instruction defines a new function called myEngine(); this function does not take any parameters. chatpairs = ( (r"(.*?)Stock price(.*)", ("Today stock price is 100", "I am unable to find out the stock price.")), (r"(.*?)not well(.*)", ("Oh, take care. May be you should visit a doctor", "Did you take some medicine ?")), (r"(.*?)raining(.*)", ("Its monsoon season, what more do you expect ?", "Yes, its good for farmers")), (r"How(.*?)health(.*)", ("I am always healthy.", "I am a program, super healthy!")), (r".*", ("I am good. How are you today ?", "What brings you here ?")) ) This is a single instruction where we are defining a nested tuple data structure and assigning it to chat pairs. Let's pay close attention to the data structure: We are defining a tuple of tuples Each subtuple consists of two elements: The first member is a regular expression (this is the user's question in regex format) The second member of the tuple is another set of tuples (these are the answers) def chat(): print("!"*80) print(" >> my Engine << ") print("Talk to the program using normal english") print("="*80) print("Enter 'quit' when done") chatbot = nltk.chat.util.Chat(chatpairs, nltk.chat.util.reflections) chatbot.converse() We are defining a subfunction called chat()inside the myEngine() function. This is permitted in Python. This chat() function displays some information to the user on the screen and calls the nltk built-in nltk.chat.util.Chat() class with the chatpairs variable. It passes nltk.chat.util.reflections as the second argument. Finally we call the chatbot.converse() function on the object that's created using the chat() class. chat() This instruction calls the chat() function, which shows a prompt on the screen and accepts the user's requests. It shows responses according to the regular expressions that we have built before: if   name    == '  main  ': for engine in ['eliza', 'iesha', 'rude', 'suntsu', 'zen']: print("=== demo of {} ===".format(engine)) builtinEngines(engine) print() myEngine() These instructions will be called when the program is invoked as a standalone program (not using import). They do these two things: Invoke the built-in engines one after another (so that we can experience them) Once all the five built-in engines are excited, they call our myEngine(), where our customer engine comes into play We have learned to create a chatbot of our own using the easiest programming language ‘Python’. To know more about how to efficiently use NLTK and implement text classification, identify parts of speech, tag words, etc check out Natural Language Processing with Python Cookbook.
Read more
  • 0
  • 0
  • 50958

article-image-how-to-use-standard-macro-in-workflows
Sunith Shetty
21 Feb 2018
6 min read
Save for later

How to use Standard Macro in Workflows

Sunith Shetty
21 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Renato Baruti titled Learning Alteryx. In this book, you will learn how to perform self-analytics and create interactive dashboards using various tools in Alteryx.[/box] Today we will learn Standard Macro that will provide you with a foundation for building enhanced workflows. The csv file required for this tutorial is available to download here. Standard Macro Before getting into Standard Macro, let's define what a macro is. A macro is a collection of workflow tools that are grouped together into one tool. Using a range of different interface tools, a macro can be developed and used within a workflow. Any workflow can be turned into a macro and a repeatable element of a workflow can commonly be converted into a macro. There are a couple of ways you can turn your workflow into a Standard Macro. The first is to go to the canvas configuration pane and navigate to the Workflow tab. This is where you select what type of workflow you want. If you select Macro you should then have Standard Macro automatically selected. Now, when you save this workflow it will save as a macro. You’ll then be able to add it to another workflow and run the process created within the macro itself. The second method is just to add a Macro Input tool from the Interface tool section onto the canvas; the workflow will then automatically change to a Standard Macro. The following screenshot shows the selection of a Standard Macro, under the Workflow tab: Let's go through an example of creating and deploying a standard macro. Standard Macro Example #1: Create a macro that allows the user to input a number used as a multiplier. Use the multiplier for the DataValueAlt field. The following steps demonstrate this process: Step 1: Select the Macro Input tool from the Interface tool palette and add the tool onto the canvas. The workflow will automatically change to a Standard Macro. Step 2: Select Text Input and Edit Data option within the Macro Input tool configuration. Step 3: Create a field called Number and enter the values: 155, 243, 128, 352, and 357 in each row, as shown in the following image: Step 4: Rename the Input Name Input and set the Anchor Abbreviation as I as shown in the following image: Step 5: Select the Formula tool from the Preparation tool palette. Connect the Formula tool to the Macro Input tool. Step 6: Select the + Add Column option in the Select Column drop down within the Formula tool configuration. Name the field Result. Step 7: Add the following expression to the expression window: [Number]*0.50 Step 8: Select the Macro Output tool from the Interface tool palette and add the tool onto the canvas. Connect the Macro Output tool to the Formula tool. Step 9: Rename the Output Name Output and set the Anchor Abbreviation as O: The Standard Macro has now been created. It can be saved to use as multiplier, to calculate the five numbers added within the Macro Input tool to multiply 0.50. This is great; however, let's take it a step further to make it dynamic and flexible by allowing the user to enter a multiplier. For instance, currently the multiplier is set to 0.50, but what if a user wants to change that to 0.25 or 0.10 to determine the 25% or 10% value of a field. Let's continue building out the Standard Macro to make this possible. Step 1: Select the Text Box tool from the Interface tool palette and drag it onto the canvas. Connect the Text Box tool to the Formula tool on the lightning bolt (the macro indicator). The Action tool will automatically be added to the canvas, as this automatically updates the configuration of a workflow with values provided by interface questions when run as an app or macro. Step 2: Configure the Action tool that will automatically update the expression replaced by a specific field. Select Formula | FormulaFields | FormulaField | @expression - value="[Number]*0.50". Select the Replace a specific string: option and enter 0.50. This is where the automation happens, updating the 0.50 to any number the user enters. You will see how this happens in the following steps: Step 3: In the Enter the text or question to be displayed text box, within the Text Box tool configuration, enter: Please enter a number: Step 4: Save the workflow as Standard Macro.yxmc. The .yxmc file type indicates it's a macro related workflow, as shown in the following image: Step 5: Open a new workflow. Step 6: Select the Input Data tool from the In/Out tool palette and connect to the U.S. Chronic Disease Indicators.csv file. Step 7: Select the Select tool from the Preparation tool palette and drag it onto the canvas. Connect the Select tool to the Input Data tool. Step 8: Change the Data Type for the DataValueAlt field to Double. Step 9: Right-click on the canvas and select Insert | Macro | Standard Macro. Step 10: Connect the Standard Macro to the Select tool. Step 11: There will be Questions to select within the Standard Macro tool configuration. Select DataValueAlt (Double) as the Choose Field option and enter 0.25 in the Please enter a number text box: Step 12: Add a Browse tool to the Standard Macro tool. Step 13: Run the workflow: The goal for creating this Standard Macro was to allow the user to select what they would like the multiplier to be rather than a static number. Let's recap what has been created and deployed using a Standard Macro. First, Standard Macro.yxmc was developed using Interface tools. The Macro Input (I) was used to enter sample text data for the Number field. This Number field is what is used to multiply to what the given multiplier is - in this case, 0.50. This is the static number multiplier. The Formula tool was used to create the expression to conclude that the Number field will be multiplied by 0.50. The Macro Output (O) was used to output the macro so that it can be used in another workflow. The Text Box tool is where the question Please enter a number will be displayed, along with the Action tool that is used to update the specific value replaced. The current multiplier, 0.50, is replaced by 0.25, as identified in step 20, through a dynamic input by which the user can enter the multiplier. Notice that, in the Browse tool output, the Result field has been added, multiplying the values for the DataValueAlt field to the multiplier 0.25. Change the value in the macro to 0.10 and run the workflow. The Result field has been updated to now multiple the values for the DataValueAlt field to the multiplier 0.10. This is a great use case of a Standard Macro and demonstrates how versatile the Interface tools are. We learned about macros and their dynamic use within workflows. We saw how Standard Macro was developed to allow the end user to specify what they want the multiplier to be. This is a great way to implement the interactivity within a workflow. To know more about high-quality interactive dashboards and efficient self-service data analytics, do checkout this book Learning Alteryx.  
Read more
  • 0
  • 0
  • 6426
Modal Close icon
Modal Close icon