How-To Tutorials

article-image-session-3-fairness-in-computer-vision-and-nlp

23 Feb 2018

6 min read

FAT Conference 2018 Session 3: Fairness in Computer Vision and NLP

23 Feb 2018

Machine learning has emerged with a vast new ecosystem of techniques and infrastructure and we are just beginning to learn their full capabilities. But with the exciting innovations happening, there are also some really concerning problems arising. Forms of bias, stereotyping and unfair determination are being found in computer vision systems, object recognition models, and in natural language processing and word embeddings. The Conference on Fairness, Accountability, and Transparency (FAT) scheduled on Feb 23 and 24 this year in New York is an annual conference dedicating to bringing theory and practice of fair and interpretable Machine Learning, Information Retrieval, NLP, Computer Vision, Recommender systems, and other technical disciplines. This year's program includes 17 peer-reviewed papers and 6 tutorials from leading experts in the field. The conference will have three sessions. Session 3 of the two-day conference on Saturday, February 24, is in the field of fairness in computer vision and NLP. In this article, we give our readers a peek into the three papers that have been selected for presentation in Session 3. You can also check out Session 1 and Session 2, in case you’ve missed them. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification What is the paper about The paper talks about substantial disparities in the accuracy of classifying darker and lighter females and males in gender classification systems. The authors have evaluated bias present in automated facial analysis algorithms and datasets with respect to phenotypic subgroups. Using the dermatologist approved Fitzpatrick Skin Type classification system, they have characterized the gender and skin type distribution of two facial analysis benchmarks, IJB-A and Adience. They have also evaluated 3 commercial gender classification systems using this dataset. Key takeaways The paper measures accuracy of 3 commercial gender classification algorithms by Microsoft, IBM, and Face++ on the new Pilot Parliaments Benchmark which is balanced by gender and skin type. On annotating the dataset with the Fitzpatrick skin classification system and testing gender classification performance on 4 subgroups, they found : All classifiers perform better on male faces than on female faces (8.1% − 20.6% difference in error rate) All classifiers perform better on lighter faces than darker faces (11.8% − 19.2% difference in error rate) All classifiers perform worst on darker female faces (20.8% − 34.7% error rate) Microsoft and IBM classifiers perform best on lighter male faces (error rates of 0.0% and 0.3% respectively) Face++ classifiers perform best on darker male faces (0.7% error rate) The maximum difference in error rate between the best and worst classified groups is 34.4% They encourage further work to see if the substantial error rate gaps on the basis of gender, skin type and intersectional subgroup revealed in this study of gender classification persist in other human-based computer vision tasks as well. Analyze, Detect and Remove Gender Stereotyping from Bollywood Movies What is the paper about The paper studies gender stereotypes and cases of bias in the Hindi movie industry (Bollywood) and propose an algorithm to remove these stereotypes from text. The authors have analyzed movie plots and posters for all movies released since 1970. The gender bias is detected by semantic modeling of plots at sentence and intra-sentence level. Different features like occupation, introductions, associated actions and descriptions are captured to show the pervasiveness of gender bias and stereotype in movies. Next, they have developed an algorithm to generate debiased stories. The proposed debiasing algorithm extracts gender biased graphs from unstructured piece of text in stories from movies and de-bias these graphs to generate plausible unbiased stories. Key takeaways The analysis is performed at sentence at multi-sentence level and uses word embeddings by adding context vector and studying the bias in data. Data observation showed that while analyzing occupations for males and females, higher level roles are designated to males while lower level roles are designated to females. A similar trend has been observed for centrality where females were less central in the plot vs their male counterparts. Also, while predicting gender using context word vectors, with very small training data, a very high accuracy was observed in gender prediction for test data reflecting a substantial amount of bias present in the data. The authors have also presented an algorithm to remove such bias present in text. They show that by interchanging the gender of high centrality male character with a high centrality female character in the plot text, leaves no change in the story but de-biases it completely. Mixed Messages? The Limits of Automated Social Media Content Analysis What is the paper about This paper broadcasts that a knowledge gap exists between data scientists studying NLP and policymakers advocating for the wide adoption of automated social media analysis and moderation. It urges policymakers to understand the capabilities and limits of NLP before endorsing or adopting automated content analysis tools, particularly for making decisions that affect fundamental rights or access to government benefits. It draws on existing research to explain the capabilities and limitations of text classifiers for social media posts and other online content. This paper is aimed at helping researchers and technical experts address the gaps in policymakers knowledge about what is possible with automated text analysis. Key takeaways The authors have provided an overview of how NLP classifiers work and identified five key limitations of these tools that must be communicated to policymakers: NLP classifiers require domain-specific training and cannot be applied with the same reliability across different domains. NLP tools can amplify social bias reflected in language and are likely to have lower accuracy for minority groups. Accurate text classification requires clear, consistent definitions of the type of speech to be identified. Policy debates around content moderation and social media mining tend to lack such precise definitions. The accuracy achieved in NLP studies does not warrant widespread application of these tools to social media content analysis and moderation. Text filters remain easy to evade and fall far short of humans ability to parse meaning from text. The paper concludes with recommendations for NLP researchers to bridge the knowledge gap between technical experts and policymakers, including Clearly describe the domain limitations of NLP tools. Increase development of non-English training resources. Provide more detail and context for accuracy measures. Publish more information about definitions and instructions provided to annotators. Don’t miss our coverage on Session 4 and Session 5 on Fair Classification, Fat recommenders, etc.

0
0
16281

article-image-how-to-implement-dynamic-sql-in-postgresql-10

Amey Varangaonkar

23 Feb 2018

7 min read

How to implement Dynamic SQL in PostgreSQL 10

Amey Varangaonkar

23 Feb 2018

7 min read

In this PostgreSQL tutorial, we'll take a close look at the concept of dynamic SQL, and how it can make the life of database programmers easy by allowing efficient querying of data. This tutorial has been taken from the second edition of Learning PostgreSQL 10. You can read more here. Dynamic SQL is used to reduce repetitive tasks when it comes to querying. For example, one could use dynamic SQL to create table partitioning for a certain table on a daily basis, to add missing indexes on all foreign keys, or add data auditing capabilities to a certain table without major coding effects. Another important use of dynamic SQL is to overcome the side effects of PL/pgSQL caching, as queries executed using the EXECUTE statement are not cached. Dynamic SQL is achieved via the EXECUTE statement. The EXECUTE statement accepts a string and simply evaluates it. The synopsis to execute a statement is given as follows: EXECUTE command-string [ INTO [STRICT] target ] [ USING expression [, ...] ]; Executing DDL statements in dynamic SQL In some cases, one needs to perform operations at the database object level, such as tables, indexes, columns, roles, and so on. For example, a database developer would like to vacuum and analyze a specific schema object, which is a common task after the deployment in order to update the statistics. For example, to analyze the car_portal_app schema tables, one could write the following script: DO $$ DECLARE table_name text; BEGIN FOR table_name IN SELECT tablename FROM pg_tables WHERE schemaname ='car_portal_app' LOOP RAISE NOTICE 'Analyzing %', table_name; EXECUTE 'ANALYZE car_portal_app.' || table_name; END LOOP; END; $$; Executing DML statements in dynamic SQL Some applications might interact with data in an interactive manner. For example, one might have billing data generated on a monthly basis. Also, some applications filter data on different criteria defined by the user. In such cases, dynamic SQL is very convenient. For example, in the car portal application, the search functionality is needed to get accounts using the dynamic predicate, as follows: CREATE OR REPLACE FUNCTION car_portal_app.get_account (predicate TEXT) RETURNS SETOF car_portal_app.account AS $$ BEGIN RETURN QUERY EXECUTE 'SELECT * FROM car_portal_app.account WHERE ' || predicate; END; $$ LANGUAGE plpgsql; To test the previous function: car_portal=> SELECT * FROM car_portal_app.get_account ('true') limit 1; account_id | first_name | last_name | email | password ------------+------------+-----------+-----------------+------------------- --------------- 1 | James | Butt | jbutt@gmail.com | 1b9ef408e82e38346e6ebebf2dcc5ece (1 row) car_portal=> SELECT * FROM car_portal_app.get_account (E'first_name='James''); account_id | first_name | last_name | email | password ------------+------------+-----------+-----------------+------------------- --------------- 1 | James | Butt | jbutt@gmail.com | 1b9ef408e82e38346e6ebebf2dcc5ece (1 row) Dynamic SQL and the caching effect As mentioned earlier, PL/pgSQL caches execution plans. This is quite good if the generated plan is expected to be static. For example, the following statement is expected to use an index scan because of selectivity. In this case, caching the plan saves some time and thus increases performance: SELECT * FROM account WHERE account_id =<INT> In other scenarios, however, this is not true. For example, let's assume we have an index on the advertisement_date column and we would like to get the number of advertisements since a certain date, as follows: SELECT count (*) FROM car_portal_app.advertisement WHERE advertisement_date >= <certain_date>; In the preceding query, the entries from the advertisement table can be fetched from the hard disk either by using the index scan or using the sequential scan based on selectivity, which depends on the provided certain_date value. Caching the execution plan of such a query will cause serious problems; thus, writing the function as follows is not a good idea: CREATE OR REPLACE FUNCTION car_portal_app.get_advertisement_count (some_date timestamptz ) RETURNS BIGINT AS $$ BEGIN RETURN (SELECT count (*) FROM car_portal_app.advertisement WHERE advertisement_date >=some_date)::bigint; END; $$ LANGUAGE plpgsql; To solve the caching issue, one could rewrite the previous function either using the SQL language function or by using the PL/pgSQL execute command, as follows: CREATE OR REPLACE FUNCTION car_portal_app.get_advertisement_count (some_date timestamptz ) RETURNS BIGINT AS $$ DECLARE count BIGINT; BEGIN EXECUTE 'SELECT count (*) FROM car_portal_app.advertisement WHERE advertisement_date >= $1' USING some_date INTO count; RETURN count; END; $$ LANGUAGE plpgsql; Recommended practices for dynamic SQL usage Dynamic SQL can cause security issues if not handled carefully; dynamic SQL is vulnerable to the SQL injection technique. SQL injection is used to execute SQL statements that reveal secure information, or even to destroy data in a database. A very simple example of a PL/pgSQL function vulnerable to SQL injection is as follows: CREATE OR REPLACE FUNCTION car_portal_app.can_login (email text, pass text) RETURNS BOOLEAN AS $$ DECLARE stmt TEXT; result bool; BEGIN stmt = E'SELECT COALESCE (count(*)=1, false) FROM car_portal_app.account WHERE email = ''|| $1 || E'' and password = ''||$2||E'''; RAISE NOTICE '%' , stmt; EXECUTE stmt INTO result; RETURN result; END; $$ LANGUAGE plpgsql; The preceding function returns true if the email and the password match. To test this function, let's insert a row and try to inject some code, as follows: car_portal=> SELECT car_portal_app.can_login('jbutt@gmail.com', md5('jbutt@gmail.com')); NOTICE: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = 'jbutt@gmail.com' and password = '1b9ef408e82e38346e6ebebf2dcc5ece' Can_login ----------- t (1 row) car_portal=> SELECT car_portal_app.can_login('jbutt@gmail.com', md5('jbutt@yahoo.com')); NOTICE: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = 'jbutt@gmail.com' and password = '37eb43e4d439589d274b6f921b1e4a0d' can_login ----------- f (1 row) car_portal=> SELECT car_portal_app.can_login(E'jbutt@gmail.com'--', 'Do not know password'); NOTICE: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = 'jbutt@gmail.com'--' and password = 'Do not know password' can_login ----------- t (1 row) Notice that the function returns true even when the password does not match the password stored in the table. This is simply because the predicate was commented, as shown by the raise notice: SELECT COALESCE (count(*)=1, false) FROM account WHERE email = 'jbutt@gmail.com'--' and password = 'Do not know password' To protect code against this technique, one could follow these practices: For parameterized dynamic SQL statements, use the USING clause. Use the format function with appropriate interpolation to construct your queries. Note that %I escapes the argument as an identifier and %L as a literal. Use quote_ident(), quote_literal(), and quote_nullable() to properly format your identifiers and literal. One way to write the preceding function is as follows: CREATE OR REPLACE FUNCTION car_portal_app.can_login (email text, pass text) RETURNS BOOLEAN AS $$ DECLARE stmt TEXT; result bool; BEGIN stmt = format('SELECT COALESCE (count(*)=1, false) FROM car_portal_app.account WHERE email = %Land password = %L', $1,$2); RAISE NOTICE '%' , stmt; EXECUTE stmt INTO result; RETURN result; END; $$ LANGUAGE plpgsql; We saw how dynamically SQL is used to build and execute queries on the fly. Unlike the static SQL statement, a dynamic SQL statements’ full text is unknown and can change between successive executions. These queries can be DDL, DCL, and/or DML statements. If you found this article useful, make sure to check out the book Learning PostgreSQL 10, to learn the fundamentals of PostgreSQL 10.

0
1
108850

article-image-working-with-pandas-dataframes

Sugandha Lahoti

23 Feb 2018

15 min read

Working with pandas DataFrames

Sugandha Lahoti

23 Feb 2018

15 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book Python Data Analysis - Second Edition written by Armando Fandango. From this book, you will learn how to process and manipulate data with Python for complex data analysis and modeling. Code bundle for this article is hosted on GitHub.[/box] The popular open source Python library, pandas is named after panel data (an econometric term) and Python data analysis. We shall learn about basic panda functionalities, data structures, and operations in this article. The official pandas documentation insists on naming the project pandas in all lowercase letters. The other convention the pandas project insists on, is the import pandas as pd import statement. We will follow these conventions in this text. In this tutorial, we will install and explore pandas. We will also acquaint ourselves with the a central pandas data structure–DataFrame. Installing and exploring pandas The minimal dependency set requirements for pandas is given as follows: NumPy: This is the fundamental numerical array package that we installed and covered extensively in the preceding chapters python-dateutil: This is a date handling library pytz: This handles time zone definitions This list is the bare minimum; a longer list of optional dependencies can be located at http://pandas.pydata.org/pandas-docs/stable/install.html. We can install pandas via PyPI with pip or easy_install, using a binary installer, with the aid of our operating system package manager, or from the source by checking out the code. The binary installers can be downloaded from http://pandas.pydata.org/getpandas.html. The command to install pandas with pip is as follows: $ pip3 install pandas rpy2 rpy2 is an interface to R and is required because rpy is being deprecated. You may have to prepend the preceding command with sudo if your user account doesn't have sufficient rights. The pandas DataFrames A pandas DataFrame is a labeled two-dimensional data structure and is similar in spirit to a worksheet in Google Sheets or Microsoft Excel, or a relational database table. The columns in pandas DataFrame can be of different types. A similar concept, by the way, was invented originally in the R programming language. (For more information, refer to http://www.r-tutor.com/r-introduction/data-frame). A DataFrame can be created in the following ways: Using another DataFrame. Using a NumPy array or a composite of arrays that has a two-dimensional shape. Likewise, we can create a DataFrame out of another pandas data structure called Series. We will learn about Series in the following section. A DataFrame can also be produced from a file, such as a CSV file. From a dictionary of one-dimensional structures, such as one-dimensional NumPy arrays, lists, dicts, or pandas Series. As an example, we will use data that can be retrieved from http://www.exploredata.net/Downloads/WHO-Data-Set. The original data file is quite large and has many columns, so we will use an edited file instead, which only contains the first nine columns and is called WHO_first9cols.csv; the file is in the code bundle of this book. These are the first two lines, including the header: Country,CountryID,Continent,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) totalAfghanistan,1,1,151,28,,,,26088 In the next steps, we will take a look at pandas DataFrames and its attributes: To kick off, load the data file into a DataFrame and print it on the screen: from pandas.io.parsers import read_csv df = read_csv("WHO_first9cols.csv") print("Dataframe", df) The printout is a summary of the DataFrame. It is too long to be displayed entirely, so we will just grab the last few lines: 199 21732.0 200 11696.0 201 13228.0 [202 rows x 9 columns] The DataFrame has an attribute that holds its shape as a tuple, similar to ndarray. Query the number of rows of a DataFrame as follows: print("Shape", df.shape) print("Length", len(df)) The values we obtain comply with the printout of the preceding step: Shape (202, 9) Length 202 Check the column header and data types with the other attributes: print("Column Headers", df.columns) print("Data types", df.dtypes) We receive the column headers in a special data structure: Column Headers Index([u'Country', u'CountryID', u'Continent', u'Adolescent fertility rate (%)', u'Adult literacy rate (%)', u'Gross national income per capita (PPP international $)', u'Net primary school enrolment ratio female (%)', u'Net primary school enrolment ratio male (%)', u'Population (in thousands) total'], dtype='object') The data types are printed as follows: 4. The pandas DataFrame has an index, which is like the primary key of relational database tables. We can either specify the index or have pandas create it automatically. The index can be accessed with a corresponding property, as follows: Print("Index", df.index) An index helps us search for items quickly, just like the index in this book. In our case, the index is a wrapper around an array starting at 0, with an increment of one for each row: Sometimes, we wish to iterate over the underlying data of a DataFrame. Iterating over column values can be inefficient if we utilize the pandas iterators. It's much better to extract the underlying NumPy arrays and work with those. The pandas DataFrame has an attribute that can aid with this as well: print("Values", df.values) Please note that some values are designated nan in the output, for 'not a number'. These values come from empty fields in the input datafile: The preceding code is available in Python Notebook ch-03.ipynb, available in the code bundle of this book. Querying data in pandas Since a pandas DataFrame is structured in a similar way to a relational database, we can view operations that read data from a DataFrame as a query. In this example, we will retrieve the annual sunspot data from Quandl. We can either use the Quandl API or download the data manually as a CSV file from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. If you want to install the API, you can do so by downloading installers from https://pypi.python.org/pypi/Quandl or by running the following command: $ pip3 install Quandl Using the API is free, but is limited to 50 API calls per day. If you require more API calls, you will have to request an authentication key. The code in this tutorial is not using a key. It should be simple to change the code to either use a key or read a downloaded CSV file. If you have difficulties, search through the Python docs at https://docs.python.org/2/. Without further preamble, let's take a look at how to query data in a pandas DataFrame: As a first step, we obviously have to download the data. After importing the Quandl API, get the data as follows: import quandl # Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual # PyPi url https://pypi.python.org/pypi/Quandl sunspots = quandl.get("SIDC/SUNSPOTS_A") The head() and tail() methods have a purpose similar to that of the Unix commands with the same name. Select the first n and last n records of a DataFrame, where n is an integer parameter: print("Head 2", sunspots.head(2) ) print("Tail 2", sunspots.tail(2)) This gives us the first two and last two rows of the sunspot data (for the sake of brevity we have not shown all the columns here; your output will have all the columns from the dataset): Head 2 Number Year 1700-12-31 5 1701-12-31 11 [2 rows x 1 columns] Tail 2 Number Year 2012-12-31 57.7 2013-12-31 64.9 [2 rows x 1 columns] Please note that we only have one column holding the number of sunspots per year. The dates are a part of the DataFrame index. The following is the query for the last value using the last date: last_date = sunspots.index[-1] print("Last value", sunspots.loc[last_date]) You can check the following output with the result from the previous step: Last value Number 64.9 Name: 2013-12-31 00:00:00, dtype: float64 Query the date with date strings in the YYYYMMDD format as follows: print("Values slice by date:n", sunspots["20020101": "20131231"]) This gives the records from 2002 through to 2013: Values slice by date Number Year 2002-12-31 104.0 [TRUNCATED] 2013-12-31 64.9 [12 rows x 1 columns] A list of indices can be used to query as well: print("Slice from a list of indices:n", sunspots.iloc[[2, 4, -4, -2]]) The preceding code selects the following rows: Slice from a list of indices Number Year 1702-12-31 16.0 1704-12-31 36.0 2010-12-31 16.0 2012-12-31 57.7 [4 rows x 1 columns] To select scalar values, we have two options. The second option given here should be faster. Two integers are required, the first for the row and the second for the column: print("Scalar with Iloc:", sunspots.iloc[0, 0]) print("Scalar with iat", sunspots.iat[1, 0]) This gives us the first and second values of the dataset as scalars: Scalar with Iloc 5.0 Scalar with iat 11.0 Querying with Booleans works much like the Where clause of SQL. The following code queries for values larger than the arithmetic mean. Note that there is a difference between when we perform the query on the whole DataFrame and when we perform it on a single column: print("Boolean selection", sunspots[sunspots > sunspots.mean()]) print("Boolean selection with column label:n", sunspots[sunspots['Number of Observations'] > sunspots['Number of Observations'].mean()]) The notable difference is that the first query yields all the rows, with some rows not conforming to the condition that has a value of NaN. The second query returns only the rows where the value is larger than the mean: Boolean selection Number Year 1700-12-31 NaN [TRUNCATED] 1759-12-31 54.0 ... [314 rows x 1 columns] Boolean selection with column label Number Year 1705-12-31 58.0 [TRUNCATED] 1870-12-31 139.1 ... [127 rows x 1 columns] The preceding example code is in the ch_03.ipynb file of this book's code bundle. Data aggregation with pandas DataFrames Data aggregation is a term used in the field of relational databases. In a database query, we can group data by the value in a column or columns. We can then perform various operations on each of these groups. The pandas DataFrame has similar capabilities. We will generate data held in a Python dict and then use this data to create a pandas DataFrame. We will then practice the pandas aggregation features: Seed the NumPy random generator to make sure that the generated data will not differ between repeated program runs. The data will have four columns: Weather (a string) Food (also a string) Price (a random float) Number (a random integer between one and nine) The use case is that we have the results of some sort of consumer-purchase research, combined with weather and market pricing, where we calculate the average of prices and keep a track of the sample size and parameters: import pandas as pd from numpy.random import seed from numpy.random import rand from numpy.random import rand_int import numpy as np seed(42) df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold','hot', 'cold', 'hot', 'cold'], 'Food' : ['soup', 'soup', 'icecream', 'chocolate', 'icecream', 'icecream', 'soup'], 'Price' : 10 * rand(7), 'Number' : rand_int(1, 9,)}) print(df) You should get an output similar to the following: Please note that the column labels come from the lexically ordered keys of the Python dict. Lexical or lexicographical order is based on the alphabetic order of characters in a string. Group the data by the Weather column and then iterate through the groups as follows: weather_group = df.groupby('Weather') i = 0 for name, group in weather_group: i = i + 1 print("Group", i, name) print(group) We have two types of weather, hot and cold, so we get two groups: The weather_group variable is a special pandas object that we get as a result of the groupby() method. This object has aggregation methods, which are demonstrated as follows: print("Weather group firstn", weather_group.first()) print("Weather group lastn", weather_group.last()) print("Weather group meann", weather_group.mean()) The preceding code snippet prints the first row, last row, and mean of each group: Just as in a database query, we are allowed to group on multiple columns. The groups attribute will then tell us the groups that are formed, as well as the rows in each group: wf_group = df.groupby(['Weather', 'Food']) print("WF Groups", wf_group.groups) For each possible combination of weather and food values, a new group is created. The membership of each row is indicated by their index values as follows: WF Groups {('hot', 'chocolate'): [3], ('cold', 'icecream'): [2, 4], ('hot', 'icecream'): [5], ('hot', 'soup'): [1], ('cold', 'soup'): [0, 6] 5. Apply a list of NumPy functions on groups with the agg() method: print("WF Aggregatedn", wf_group.agg([np.mean, np.median])) Obviously, we could apply even more functions, but it would look messier than the following output: Concatenating and appending DataFrames The pandas DataFrame allows operations that are similar to the inner and outer joins of database tables. We can append and concatenate rows as well. To practice appending and concatenating of rows, we will reuse the DataFrame from the previous section. Let's select the first three rows: print("df :3n", df[:3]) Check that these are indeed the first three rows: df :3 Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold The concat() function concatenates DataFrames. For example, we can concatenate a DataFrame that consists of three rows to the rest of the rows, in order to recreate the original DataFrame: print("Concat Back togethern", pd.concat([df[:3], df[3:]])) The concatenation output appears as follows: Concat Back together Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold 3 chocolate 8 5.986585 hot 4 icecream 8 1.560186 cold 5 icecream 3 1.559945 hot 6 soup 6 0.580836 cold [7 rows x 4 columns] To append rows, use the append() function: print("Appending rowsn", df[:3].append(df[5:])) The result is a DataFrame with the first three rows of the original DataFrame and the last two rows appended to it: Appending rows Food Number Price Weather 0 soup 8 3.745401 cold 1 soup 5 9.507143 hot 2 icecream 4 7.319939 cold 5 icecream 3 1.559945 hot 6 soup 6 0.580836 cold [5 rows x 4 columns] Joining DataFrames To demonstrate joining, we will use two CSV files-dest.csv and tips.csv. The use case behind it is that we are running a taxi company. Every time a passenger is dropped off at his or her destination, we add a row to the dest.csv file with the employee number of the driver and the destination: EmpNr,Dest5,The Hague3,Amsterdam9,Rotterdam Sometimes drivers get a tip, so we want that registered in the tips.csv file (if this doesn't seem realistic, please feel free to come up with your own story): EmpNr,Amount5,109,57,2.5 Database-like joins in pandas can be done with either the merge() function or the join() DataFrame method. The join() method joins onto indices by default, which might not be what you want. In SQL a relational database query language we have the inner join, left outer join, right outer join, and full outer join. An inner join selects rows from two tables, if and only if values match, for columns specified in the join condition. Outer joins do not require a match, and can potentially return more rows. More information on joins can be found at http://en.wikipedia.org/wiki/Join_%28SQL%29. All these join types are supported by pandas, but we will only take a look at inner joins and full outer joins: A join on the employee number with the merge() function is performed as follows: print("Merge() on keyn", pd.merge(dests, tips, on='EmpNr')) This gives an inner join as the outcome: Merge() on key EmpNr Dest Amount 0 5 The Hague 10 1 9 Rotterdam 5 [2 rows x 3 columns] Joining with the join() method requires providing suffixes for the left and right operands: print("Dests join() tipsn", dests.join(tips, lsuffix='Dest', rsuffix='Tips')) This method call joins index values so that the result is different from an SQL inner join: Dests join() tips EmpNrDest Dest EmpNrTips Amount 0 5 The Hague 5 10.0 1 3 Amsterdam 9 5.0 2 9 Rotterdam 7 2.5 [3 rows x 4 columns] An even more explicit way to execute an inner join with merge() is as follows: print("Inner join with merge()n", pd.merge(dests, tips, how='inner')) The output is as follows: Inner join with merge() EmpNr Dest Amount 0 5 The Hague 10 1 9 Rotterdam 5 [2 rows x 3 columns] To make this a full outer join requires only a small change: print("Outer joinn", pd.merge(dests, tips, how='outer')) The outer join adds rows with NaN values: Outer join EmpNr Dest Amount 0 5 The Hague 10.0 1 3 Amsterdam NaN 2 9 Rotterdam 5.0 3 7 NaN 2.5 [4 rows x 3 columns] In a relational database query, these values would have been set to NULL. The demo code is in the ch-03.ipynb file of this book's code bundle. We learnt how to perform various data manipulation techniques such as aggregating, concatenating, appending, cleaning, and handling missing values, with pandas. If you found this post useful, check out the book Python Data Analysis - Second Edition to learn advanced topics such as signal processing, textual data analysis, machine learning, and more.

0
0
16966

article-image-getting-started-with-apache-kafka-clusters

Amarabha Banerjee

23 Feb 2018

10 min read

Getting Started with Apache Kafka Clusters

Amarabha Banerjee

23 Feb 2018

10 min read

[box type="note" align="" class="" width=""]Below given article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book contains easy to follow recipes to help you set-up, configure and use Apache Kafka in the best possible manner.[/box] Here in this article, we are going to talk about how you can get started with Apache Kafka clusters and implement them seamlessly. In Apache Kafka there are three types of clusters: Single-node single-broker Single-node multiple-broker Multiple-node multiple-broker cluster The following four recipes show how to run Apache Kafka in these clusters. Configuring a single-node single-broker cluster – SNSB The first cluster configuration is single-node single-broker (SNSB). This cluster is very useful when a single point of entry is needed. Yes, its architecture resembles the singleton design pattern. A SNSB cluster usually satisfies three requirements: Controls concurrent access to a unique shared broker Access to the broker is requested from multiple, disparate producers There can be only one broker If the proposed design has only one or two of these requirements, a redesign is almost always the correct option. Sometimes, the single broker could become a bottleneck or a single point of failure. But it is useful when a single point of communication is needed. Getting ready Go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it... The diagram shows an example of an SNSB cluster: Starting ZooKeeper Kafka provides a simple ZooKeeper configuration file to launch a single ZooKeeper instance. To install the ZooKeeper instance, use this command: > bin/zookeeper-server-start.sh config/zookeeper.properties The main properties specified in the zookeeper.properties file are: clientPort: This is the listening port for client requests. By default, ZooKeeper listens on TCP port 2181: clientPort=2181 dataDir: This is the directory where ZooKeeper is stored: dataDir=/tmp/zookeeper means unbounded): maxClientCnxns=0 For more information about Apache ZooKeeper visit the project home page at: http://zookeeper.apache.org/. Starting the broker After ZooKeeper is started, start the Kafka broker with this command: > bin/kafka-server-start.sh config/server.properties The main properties specified in the server.properties file are: broker.id: The unique positive integer identifier for each broker: broker.id=0 log.dir: Directory to store log files: log.dir=/tmp/kafka10-logs num.partitions: The number of log partitions per topic: num.partitions=2 port: The port that the socket server listens on: port=9092 zookeeper.connect: The ZooKeeper URL connection: zookeeper.connect=localhost:2181 How it works Kafka uses ZooKeeper for storing metadata information about the brokers, topics, and partitions. Writes to ZooKeeper are performed only on changes of consumer group membership or on changes to the Kafka cluster itself. This amount of traffic is minimal, and there is no need for a dedicated ZooKeeper ensemble for a single Kafka cluster. Actually, many deployments use a single ZooKeeper ensemble to control multiple Kafka clusters (using a chroot ZooKeeper path for each cluster). SNSB – creating a topic, producer, and consumer The SNSB Kafka cluster is running; now let's create topics, producer, and consumer. Getting ready We need the previous recipe executed: Kafka already installed ZooKeeper up and running A Kafka server up and running Now, go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it The following steps will show you how to create an SNSB topic, producer, and consumer. Creating a topic As we know, Kafka has a command to create topics. Here we create a topic called SNSBTopic with one partition and one replica: > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 1 --partitions 1 --topic SNSBTopic We obtain the following output: Created topic "SNSBTopic". The command parameters are: --replication-factor 1: This indicates just one replica --partition 1: This indicates just one partition --zookeeper localhost:2181: This indicates the ZooKeeper URL As we know, to get the list of topics on a Kafka server we use the following command: > bin/kafka-topics.sh --list --zookeeper localhost:2181 We obtain the following output: SNSBTopic Starting the producer Kafka has a command to start producers that accepts inputs from the command line and publishes each input line as a message. By default, each new line is considered a message: > bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic SNSBTopic This command requires two parameters: broker-list: The broker URL to connect to topic: The topic name (to send a message to the topic subscribers) Now, type the following in the command line: The best thing about a boolean is [Enter] even if you are wrong [Enter] you are only off by a bit. [Enter] This output is obtained (as expected): The best thing about a boolean is even if you are wrong you are only off by a bit. The producer.properties file has the producer configuration. Some important properties defined in the producer.properties file are: metadata.broker.list: The list of brokers used for bootstrapping information on the rest of the cluster in the format host1:port1, host2:port2: metadata.broker.list=localhost:9092 compression.codec: The compression codec used. For example, none, gzip, and snappy: compression.codec=none Starting the consumer Kafka has a command to start a message consumer client. It shows the output in the command line as soon as it has subscribed to the topic: > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic SNSBTopic --from-beginning Note that the parameter from-beginning is to show the entire log: The best thing about a boolean is even if you are wrong you are only off by a bit. One important property defined in the consumer.properties file is: group.id: This string identifies the consumers in the same group: group.id=test-consumer-group There's more It is time to play with this technology. Open a new command-line window for ZooKeeper, a broker, two producers, and two consumers. Type some messages in the producers and watch them get displayed in the consumers. If you don't know or don't remember how to run the commands, run it with no arguments to display the possible values for the Parameters. Configuring a single-node multiple-broker cluster – SNMB The second cluster configuration is single-node multiple-broker (SNMB). This cluster is used when there is just one node but inner redundancy is needed. When a topic is created in Kafka, the system determines how each replica of a partition is mapped to each broker. In general, Kafka tries to spread the replicas across all available brokers. The messages are first sent to the first replica of a partition (to the current broker leader of that partition) before they are replicated to the remaining brokers. The producers may choose from different strategies for sending messages (synchronous or asynchronous mode). Producers discover the available brokers in a cluster and the partitions on each (all this by registering watchers in ZooKeeper). In practice, some of the high volume topics are configured with more than one partition per broker. Remember that having more partitions increases the I/O parallelism for writes and this increases the degree of parallelism for consumers (the partition is the unit for distributing data to consumers). On the other hand, increasing the number of partitions increases the overhead because: There are more files, so more open file handlers There are more offsets to be checked by consumers, so the ZooKeeper load is increased The art of this is to balance these tradeoffs. Getting ready Go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka The following diagram shows an example of an SNMB cluster: How to do it Begin starting the ZooKeeper server as follows: > bin/zookeeper-server-start.sh config/zookeeper.properties A different server.properties file is needed for each broker. Let's call them: server-1.properties, server-2.properties, server-3.properties, and so on (original, isn't it?). Each file is a copy of the original server.properties file. In the server-1.properties file set the following properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 Similarly, in the server-2.properties file set the following properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2 Finally, in the server-3.properties file set the following properties: broker.id=3 port=9095 log.dir=/tmp/kafka-logs-3 With ZooKeeper running, start the Kafka brokers with these commands: > bin/kafka-server-start.sh config/server-1.properties > bin/kafka-server-start.sh config/server-2.properties > bin/kafka-server-start.sh config/server-3.properties How it works Now the SNMB cluster is running. The brokers are running on the same Kafka node, on ports 9093, 9094, and 9095. SNMB – creating a topic, producer, and consumer The SNMB Kafka cluster is running; now let's create topics, producer, and consumer. Getting ready We need the previous recipe executed: Kafka already installed ZooKeeper up and running A Kafka server up and running Now, go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it The following steps will show you how to create an SNMB topic, producer, and consumer Creating a topic Using the command to create topics, let's create a topic called SNMBTopic with two partitions and two replicas: > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 2 --partitions 3 --topic SNMBTopic The following output is displayed: Created topic "SNMBTopic" This command has the following effects: Kafka will create three logical partitions for the topic. Kafka will create two replicas (copies) per partition. This means, for each partition it will pick two brokers that will host those replicas. For each partition, Kafka will randomly choose a broker Leader. Now ask Kafka for the list of available topics. The list now includes the new SNMBTopic: > bin/kafka-topics.sh --zookeeper localhost:2181 --list SNMBTopic Starting a producer Now, start the producers; indicating more brokers in the broker-list is easy: > bin/kafka-console-producer.sh --broker-list localhost:9093, localhost:9094, localhost:9095 --topic SNMBTopic If it's necessary to run multiple producers connecting to different brokers, specify a different broker list for each producer. Starting a consumer To start a consumer, use the following command: > bin/kafka-console-consumer.sh -- zookeeper localhost:2181 --frombeginning --topic SNMBTopic How it works The first important fact is the two parameters: replication-factor and partitions. The replication-factor is the number of replicas each partition will have in the topic created. The partitions parameter is the number of partitions for the topic created. There's more If you don't know the cluster configuration or don't remember it, there is a useful option for the kafka-topics command, the describe parameter: > bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic SNMBTopic The output is something similar to: Topic:SNMBTopic PartitionCount:3 ReplicationFactor:2 Configs: Topic: SNMBTopic Partition: 0 Leader: 2 Replicas: 2,3 Isr: 3,2 Topic: SNMBTopic Partition: 1 Leader: 3 Replicas: 3,1 Isr: 1,3 Topic: SNMBTopic Partition: 2 Leader: 1 Replicas: 1,2 Isr: 1,2 An explanation of the output: the first line gives a summary of all the partitions; each line gives information about one partition. Since we have three partitions for this topic, there are three lines: Leader: This node is responsible for all reads and writes for a particular partition. For a randomly selected section of the partitions each node is the leader. Replicas: This is the list of nodes that duplicate the log for a particular partition irrespective of whether it is currently alive. Isr: This is the set of in-sync replicas. It is a subset of the replicas currently alive and following the leader. In order to see the options for: create, delete, describe, or change a topic, type this command without parameters: > bin/kafka-topics.sh We discussed how to implement Apache Kafka clusters effectively. If you liked this post, be sure to check out Apache Kafka 1.0 Cookbook which consists of useful recipes to work with Apache Kafka installation.

0
1
7376

article-image-introduction-wordpress-plugin

Packt

22 Feb 2018

13 min read

Introduction to WordPress Plugin

Packt

22 Feb 2018

13 min read

In this article, Yannick Lefebvre, author of Wordpress Plugin Development Cookbook, Second Edition will cover the following recipes: Creating a new shortcode with parameters Managing multiple sets of user settings from a single admin page WordPress shortcodes are a simple, yet powerful tool that can be used to automate the insertion of code into web pages. For example, a shortcode could be used to automate the insertion of videos from a third-party platform that is not supported natively by WordPress, or embed content from a popular web site. By following the two code samples found in this article, you will learn how to create a WordPress plugin that defines your own shortcode to be able to quickly embed Twitter feeds on a web site. You will also learn how to create an administration configuration panel to be able to create a set of configurations that can be referenced when using your newly-created shortcode. Creating a new shortcode with parameters While simple shortcodes already provide a lot of potential to output complex content to a page by entering a few characters in the post editor, shortcodes become even more useful when they are coupled with parameters that will be passed to their associated processing function. Using this technique, it becomes very easy to create a shortcode that accelerates the insertion of external content in WordPress posts or pages by only needing to specify the shortcode and the unique identifier of the source element to be displayed. We will illustrate this concept in this recipe by creating a shortcode that will be used to quickly add Twitter feeds to posts or pages. How to do it... Navigate to the WordPress plugin directory of your development installation. Create a new directory called ch3-twitter-embed. Navigate to this directory and create a new text file called ch3-twitter-embed.php. Open the new file in a code editor and add an appropriate header at the top of the plugin file, naming the plugin Chapter 2 - Twitter Embed. Add the following line of code to declare a new shortcode and specify the name of the function that should be called when the shortcode is found in posts or pages: add_shortcode( 'twitterfeed', 'ch3te_twitter_embed_shortcode' ); Add the following code section to provide an implementation for the ch3te_twitter_embed_shortcode function: function ch3te_twitter_embed_shortcode( $atts ) { extract( shortcode_atts( array( 'user_name' => 'ylefebvre' ), $atts ) ); if ( !empty( $user_name ) ) { $output = '<a class="twitter-timeline" href="'; $output .= esc_url( 'https://twitter.com/' . $user_name ); $output .= '">Tweets by ' . esc_html( $user_name ); $output .= '</a><script async '; $output .= 'src="//platform.twitter.com/widgets.js"'; $output .= ' charset="utf-8"></script>'; } else { $output = ''; } return $output; }. Save and close the plugin file. Log in to the administration page of your development WordPress installation. Click on Plugins in the left-hand navigation menu. Activate your new plugin. Create a new page and use the shortcode [twitterfeed user_name='WordPress'] in the page editor, where WordPress is the Twitter username of the feed to display: Save and view the page to see that the shortcode was replaced by an embedded Twitter feed on your site. Edit the page and remove the user_name parameter and its associated value, only leaving the core [twitterfeed] shortcode in the post and Save. Refresh the page and see that the feed is still being displayed but now shows tweets from another account. How it works... When shortcodes are used with parameters, these extra pieces of data are sent to the associated processing function in the $atts parameter variable. By using a combination of the standard PHP extract and WordPress-specific shortcode_atts functions, our plugin is able to parse the data sent to the shortcode and create an array of identifiers and values that are subsequently transformed into PHP variables that we can use in the rest of our shortcode implementation function. In this specific example, we expect a single variable to be used, called user_name, which will be stored in a PHP variable called $user_name. If the user enters the shortcode without any parameter, a default value of ylefebvre will be assigned to the username variable to ensure that the plugin still works. Since we are going to accept user input in this code, we also verify that the user did not provide an empty string and we use the esc_html and esc_url functions to remove any potentially harmful HTML characters from the input string and make sure that the link destination URL is valid. Once we have access to the twitter username, we can put together the required HTML code that will embed a Twitter feed in our page and display the selected user's tweets. While this example only has one argument, it is possible to define multiple parameters for a shortcode. Managing multiple sets of user settings from a single admin page Throughout this article, you have learned how to create configuration pages to manage single sets of configuration options for our plugins. In some cases, only being able to specify a single set of options will not be enough. For example, looking back at the Twitter embed shortcode plugin that was created, a single configuration panel would only allow users to specify one set of options, such as the desired twitter feed dimensions or the number of tweets to display. A more flexible solution would be to allow users to specify multiple sets of configuration options, which could then be called up by using an extra shortcode parameter (for example, [twitterfeed user_name="WordPress" option_id="2"]). While the first thought that might cross your mind to configure such a plugin is to create a multi-level menu item with submenus to store a number of different settings, this method would produce a very awkward interface for users to navigate. A better way is to use a single panel but give the user a way to select between multiple sets of options to be modified. In this recipe, you will learn how to enhance the previously created Twitter feed shortcode plugin to be able to control the embedded feed size and number of tweets to display from the plugin configuration panel and to give users the ability to specify multiple display sizes. Getting ready You should have already followed the Creating a new shortcode with parameters recipe in the article to have a starting point for this recipe. Alternatively, you can get the resulting code (Chapter 2/ch3-twitter-embed/ch3-twitter-embed.php) from the downloaded code bundle. How to do it... Navigate to the ch3-twitter-embed folder of the WordPress plugin directory of your development installation. Open the ch3-twitter-embed.php file in a text editor. Add the following lines of code to implement an activation callback to initialize plugin options when it is installed or upgraded: register_activation_hook( __FILE__, 'ch3te_set_default_options_array' ); function ch3te_set_default_options_array() { ch3te_get_options(); } Introduction to WordPress Plugin [ 6 ] function ch3te_get_options( $id = 1 ) { $options = get_option( 'ch3te_options_' . $id, array() ); $new_options['setting_name'] = 'Default'; $new_options['width'] = 560; $new_options['number_of_tweets'] = 3; $merged_options = wp_parse_args( $options, $new_options ); $compare_options = array_diff_key( $new_options, $options ); if ( empty( $options ) || !empty( $compare_options ) ) { update_option( 'ch3te_options_' . $id, $merged_options ); } return $merged_options; } Insert the following code segment to register a function to be called when the administration menu is put together. When this happens, the callback function adds an item to the Settings menu and specifies the function to be called to render the configuration page: // Assign function to be called when admin menu is constructed add_action( 'admin_menu', 'ch3te_settings_menu' ); // Function to add item to Settings menu and // specify function to display options page content function ch3te_settings_menu() { add_options_page( 'Twitter Embed Configuration', 'Twitter Embed', 'manage_options', 'ch3te-twitter-embed', 'ch3te_config_page' ); Add the following code to implement the configuration page rendering function: // Function to display options page content function ch3te_config_page() { // Retrieve plugin configuration options from database if ( isset( $_GET['option_id'] ) ) { $option_id = intval( $_GET['option_id'] ); } elseif ( isset( $_POST['option_id'] ) ) { $option_id = intval( $_POST['option_id'] ); } else { Introduction to WordPress Plugin [ 7 ] $option_id = 1; } $options = ch3te_get_options( $option_id ); ?> <div id="ch3te-general" class="wrap"> <h3>Twitter Embed</h3>  <?php if ( isset( $_GET['message'] ) && $_GET['message'] == '1' ) { ?> <div id='message' class='updated fade'> <p><strong>Settings Saved</strong></p></div> <?php } ?>  <div id="icon-themes" class="icon32"><br></div> <h3 class="nav-tab-wrapper"> <?php for ( $counter = 1; $counter <= 5; $counter++ ) { $temp_options = ch3te_get_options( $counter); $class = ( $counter == $option_id ) ? ' nav-tabactive' : ''; ?> <a class="nav-tab<?php echo $class; ?>" href="<?php echo add_query_arg( array( 'page' => 'ch3te-twitterembed', 'option_id' => $counter ), admin_url( 'options-general.php' ) ); ?>"><?php echo $counter; ?><?php if ( $temp_options !== false ) echo ' (' . $temp_options['setting_name'] . ')'; else echo ' (Empty)'; ?></a> <?php } ?> </h3><br />  <form name="ch3te_options_form" method="post" action="admin-post.php"> <input type="hidden" name="action" value="save_ch3te_options" /> <input type="hidden" name="option_id" value="<?php echo $option_id; ?>" /> <?php wp_nonce_field( 'ch3te' ); ?> <table> <tr><td>Setting name</td> <td><input type="text" name="setting_name" value="<?php echo esc_html( $options['setting_name'] ); ?>"/> </td> </tr> <tr><td>Feed width</td> <td><input type="text" name="width" Introduction to WordPress Plugin [ 8 ] value="<?php echo esc_html( $options['width'] ); ?>"/></td> </tr> <tr><td>Number of Tweets to display</td> <td><input type="text" name="number_of_tweets" value="<?php echo esc_html( $options['height'] ); ?>"/></td> </tr> </table><br /> <input type="submit" value="Submit" class="buttonprimary" /> </form> </div> <?php } Add the following block of code to register a function that will process user options when submitted to the site: add_action( 'admin_init', 'ch3te_admin_init' ); function ch3te_admin_init() { add_action( 'admin_post_save_ch3te_options', 'process_ch3te_options' ); Add the following code to implement the process_ch3te_options function, declared in the previous block of code, and to declare a utility function used to clean the redirection path: // Function to process user data submission function process_ch3te_options() { // Check that user has proper security level if ( !current_user_can( 'manage_options' ) ) { wp_die( 'Not allowed' ); } // Check that nonce field is present check_admin_referer( 'ch3te' ); // Check if option_id field was present if ( isset( $_POST['option_id'] ) ) { $option_id = intval( $_POST['option_id'] ); } else { $option_id = 1; } // Build option name and retrieve options $options = ch3te_get_options( $option_id ); // Cycle through all text fields and store their Introduction to WordPress Plugin [ 9 ] values foreach ( array( 'setting_name' ) as $param_name ) { if ( isset( $_POST[$param_name] ) ) { $options[$param_name] = sanitize_text_field( $_POST[$param_name] ); } } // Cycle through all numeric fields, convert to int and store foreach ( array( 'width', 'number_of_tweets' ) as $param_name ) { if ( isset( $_POST[$param_name] ) ) { $options[$param_name] = intval( $_POST[$param_name] ); } } // Store updated options array to database $options_name = 'ch3te_options_' . $option_id; update_option( $options_name, $options ); $cleanaddress = add_query_arg( array( 'message' => 1, 'option_id' => $option_id, 'page' => 'ch3te-twitter-embed' ), admin_url( 'options-general.php' ) ); wp_redirect( $cleanaddress ); exit; } // Function to process user data submission function process_ch3te_options() { // Check that user has proper security level if ( !current_user_can( 'manage_options' ) ) { wp_die( 'Not allowed' ); } // Check that nonce field is present check_admin_referer( 'ch3te' ); // Check if option_id field was present if ( isset( $_POST['option_id'] ) ) { $option_id = intval( $_POST['option_id'] ); } else { $option_id = 1; } // Build option name and retrieve options $options = ch3te_get_options( $option_id ); // Cycle through all text fields and store their values foreach ( array( 'setting_name' ) as $param_name ) { if ( isset( $_POST[$param_name] ) ) { $options[$param_name] = sanitize_text_field( $_POST[$param_name] ); } } Find the ch3te_twitter_embed_shortcode function and modify it as follows to accept the new option_id parameter and load the plugin options to produce the desired output. The changes are identified in bold within the recipe: function ch3te_twitter_embed_shortcode( $atts ) { extract( shortcode_atts( array( 'user_name' => 'ylefebvre', 'option_id' => '1' ), $atts ) ); if ( intval( $option_id ) < 1 || intval( $option_id ) > 5 ) { $option_id = 1; } $options = ch3te_get_options( $option_id ); if ( !empty( $user_name ) ) { $output = '<a class="twitter-timeline" href="'; $output .= esc_url( 'https://twitter.com/' . $user_name ); $output .= '" data-width="' . $options['width'] . Save and close the plugin file. Deactivate and then Activate the Chapter 2 - Twitter Embed plugin from the administration interface to execute its activation function and create default settings. Navigate to the Settings menu and select the Twitter Embed submenu item to see the newly created configuration panel with the first set of options being displayed and more sets of options accessible through the drop-down list shown at the top of the page. To select the set of options to be used, add the parameter option_id to the shortcode used to display a Twitter feed, as follows: [twitterfeed user_name="WordPress" option_id="1"] How it works... This recipe shows how we can leverage options arrays to create multiple sets of options simply by creating the name of the options array on the fly. Instead of having a specific option name in the first parameter of the get_option function call, we create a string with an option ID. This ID is sent through as a URL parameter on the configuration page and as a hidden text field when processing the form data. On initialization, the plugin only creates a single set of options, which is probably enough for most casual users of the plugin. Doing so will avoid cluttering the site database with useless options. When the user requests to view one of the empty option sets, the plugin creates a new set of options right before rendering the options page. The rest of the code is very similar to the other examples that we saw in this article, since the way to access the array elements remains the same. Summary In this article, the author has explained about the entire process of how to create a new shortcode with parameters and how to manage multiple sets of user settings from a single admin page.

0
0
36354

article-image-how-to-perform-numeric-metric-aggregations-with-elasticsearch

Pravin Dhandre

22 Feb 2018

7 min read

How to perform Numeric Metric Aggregations with Elasticsearch

Pravin Dhandre

22 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book Learning Elastic Stack 6.0 written by Pranav Shukla and Sharath Kumar M N . This book provides detailed coverage on fundamentals of each components of Elastic Stack, making it easy to search, analyze and visualize data across different sources in real-time.[/box] Today, we are going to demonstrate how to run numeric and statistical queries such as summation, average, count and various similar metric aggregations on Elastic Stack to serve a better analytics engine on your dataset. Metric aggregations Metric aggregations work with numeric data, computing one or more aggregate metrics within the given context. The context could be a query, filter, or no query to include the whole index/type. Metric aggregations can also be nested inside other bucket aggregations. In this case, these metrics will be computed for each bucket in the bucket aggregations. We will start with simple metric aggregations without nesting them inside bucket aggregations. When we learn about bucket aggregations later in the chapter, we will also learn how to use metric aggregations inside bucket aggregations. We will learn about the following metric aggregations: Sum, average, min, and max aggregations Stats and extended stats aggregations Cardinality aggregation Let us learn about them one by one. Sum, average, min, and max aggregations Finding the sum of a field, the minimum value for a field, the maximum value for a field, or an average, are very common operations. For the people who are familiar with SQL, the query to find the sum would look like the following: SELECT sum(downloadTotal) FROM usageReport; The preceding query will calculate the sum of the downloadTotal field across all records in the table. This requires going through all records of the table or all records in the given context and adding the values of the given fields. In Elasticsearch, a similar query can be written using the sum aggregation. Let us understand the sum aggregation first. Sum aggregation Here is how to write a simple sum aggregation: GET bigginsight/_search { "aggregations": { 1 "download_sum": { 2 "sum": { 3 "field": "downloadTotal" 4 } } }, "size": 0 5 } The aggs or aggregations element at the top level should wrap any aggregation. Give a name to the aggregation; here we are doing the sum aggregation on the downloadTotal field and hence the name we chose is download_sum. You can name it anything. This field will be useful while looking up this particular aggregation's result in the response. We are doing a sum aggregation, hence the sum element. We want to do term aggregation on the downloadTotal field. Specify size = 0 to prevent raw search results from being returned. We just want aggregation results and not the search results in this case. Since we haven't specified any top level query elements, it matches all documents. We do not want any raw documents (or search hits) in the result. The response should look like the following: { "took": 92, ... "hits": { "total": 242836, 1 "max_score": 0, "hits": [] }, "aggregations": { 2 "download_sum": { 3 "value": 2197438700 4 } } } Let us understand the key aspects of the response. The key parts are numbered 1, 2, 3, and so on, and are explained in the following points: The hits.total element shows the number of documents that were considered or were in the context of the query. If there was no additional query or filter specified, it will include all documents in the type or index. Just like the request, this response is wrapped inside aggregations to indicate as Such. The response of the aggregation requested by us was named download_sum, hence we get our response from the sum aggregation inside an element with the same name. The actual value after applying the sum aggregation. The average, min, and max aggregations are very similar. Let's look at them briefly. Average aggregation The average aggregation finds an average across all documents in the querying context: GET bigginsight/_search { "aggregations": { "download_average": { 1 "avg": { 2 "field": "downloadTotal" } } }, "size": 0 } The only notable differences from the sum aggregation are as follows: We chose a different name, download_average, to make it apparent that the aggregation is trying to compute the average. The type of aggregation that we are doing is avg instead of the sum aggregation that we were doing earlier. The response structure is identical but the value field will now represent the average of the requested field. The min and max aggregations are the exactly same. Min aggregation Here is how we will find the minimum value of the downloadTotal field in the entire index/type: GET bigginsight/_search { "aggregations": { "download_min": { "min": { "field": "downloadTotal" } } }, "size": 0 } Let's finally look at max aggregation also. Max aggregation Here is how we will find the maximum value of the downloadTotal field in the entire index/type: GET bigginsight/_search { "aggregations": { "download_max": { "max": { "field": "downloadTotal" } } }, "size": 0 } These aggregations were really simple. Now let's look at some more advanced yet simple stats and extended stats aggregations. Stats and extended stats aggregations These aggregations compute some common statistics in a single request without having to issue multiple requests. This saves resources on the Elasticsearch side as well because the statistics are computed in a single pass rather than being requested multiple times. The client code also becomes simpler if you are interested in more than one of these statistics. Let's look at the stats aggregation first. Stats aggregation The stats aggregation computes the sum, average, min, max, and count of documents in a single pass: GET bigginsight/_search { "aggregations": { "download_stats": { "stats": { "field": "downloadTotal" } } }, "size": 0 } The structure of the stats request is the same as the other metric aggregations we have seen so far, so nothing special is going on here. The response should look like the following: { "took": 4, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "download_stats": { "count": 242835, "min": 0, "max": 241213, "avg": 9049.102065188297, "sum": 2197438700 } } } As you can see, the response with the download_stats element contains count, min, max, average, and sum; everything is included in the same response. This is very handy as it reduces the overhead of multiple requests and also simplifies the client code. Let us look at the extended stats aggregation. Extended stats Aggregation The extended stats aggregation returns a few more statistics in addition to the ones returned by the stats aggregation: GET bigginsight/_search { "aggregations": { "download_estats": { "extended_stats": { "field": "downloadTotal" } } }, "size": 0 } The response looks like the following: { "took": 15, "timed_out": false, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "download_estats": { "count": 242835, "min": 0, "max": 241213, "avg": 9049.102065188297, "sum": 2197438700, "sum_of_squares": 133545882701698, "variance": 468058704.9782911, "std_deviation": 21634.664429528162, "std_deviation_bounds": { "upper": 52318.43092424462, "lower": -34220.22679386803 } } } } It also returns the sum of squares, variance, standard deviation, and standard deviation Bounds. Cardinality aggregation Finding the count of unique elements can be done with the cardinality aggregation. It is similar to finding the result of a query such as the following: select count(*) from (select distinct username from usageReport) u; Finding the cardinality or the number of unique values for a specific field is a very common requirement. If you have click-stream from the different visitors on your website, you may want to find out how many unique visitors you got in a given day, week, or month. Let us understand how we find out the count of unique users for which we have network traffic data: GET bigginsight/_search { "aggregations": { "unique_visitors": { "cardinality": { "field": "username" } } }, "size": 0 } The cardinality aggregation response is just like the other metric aggregations: { "took": 110, ..., "hits": { "total": 242836, "max_score": 0, "hits": [] }, "aggregations": { "unique_visitors": { "value": 79 } } } To summarize, we learned how to perform numerous metric aggregations on numeric datasets and easily deploy elasticsearch in building powerful analytics application. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 to examine the fundamentals of Elastic Stack in detail and start developing solutions for problems like logging, site search, app search, metrics and more.

0
0
23309

article-image-combine-data-files-within-ibm-spss-modeler

Amey Varangaonkar

22 Feb 2018

6 min read

How to combine data files within IBM SPSS Modeler

Amey Varangaonkar

22 Feb 2018

6 min read

[box type="note" align="" class="" width=""]The following extract is taken from the book IBM SPSS Modeler Essentials, written by Keith McCormick and Jesus Salcedo. SPSS Modeler is one of the popularly used enterprise tools for data mining and predictive analytics. [/box] In this article, we will explore how SPSS Modeler can be effectively used to combine different file types for efficient data modeling. In many organizations, different pieces of information for the same individuals are held in separate locations. To be able to analyze such information within Modeler, the data files must be combined into one single file. The Merge node joins two or more data sources so that information held for an individual in different locations can be analyzed collectively. The following diagram shows how the Merge node can be used to combine two separate data files that contain different types of information: Like the Append node, the Merge node is found in the Record Ops palette. This node takes multiple data sources and creates a single source containing all or some of the input fields. Let's go through an example of how to use the Merge node to combine data files: Open the Merge stream. The Merge stream contains the files we previously appended, as well as the main data file we were working with in earlier chapters. 2. Place a Merge node from the Record Ops palette on the canvas. 3. Connect the last Reclassify node to the Merge node. 4. Connect the Filter node to the Merge node. [box type="info" align="" class="" width=""]Like the Append node, the order in which data sources are connected to the Merge node impacts the order in which the sources are displayed. The fields of the first source connected to the Merge node will appear first, followed by the fields of the second source connected to the Merge node, and so on.[/box] 5. Connect the Merge node to a Table node: 6. Edit the Merge node. Since the Merge node can cope with a variety of different situations, the Merge tab allows you to specify the merging method. There are four methods for merging: Order: It joins the first record in the first dataset with the first record in the second dataset, and so on. If any of the datasets run out of records, no further output records are produced. This method can be dangerous if there happens to be any cases that are missing from a file, or if files have been sorted differently. Keys: It is the most commonly used method, used when records that have the same value in the field(s) defined as the key are merged. If multiple records contain the same value on the key field, all possible merges are returned. Condition: It joins records from files that meet a specified condition. Ranked condition: It specifies whether each row pairing in the primary dataset and all secondary datasets are to be merged; use the ranking expression to sort any multiple matches from low to high order. Let's combine these files. To do this: Set Merge Method to Keys. Fields contained in all input sources appear in the Possible keys list. To identify one of more fields as the key field(s), move the selected field into the Keys for merge list. In our case, there are two fields that appear in both files, ID and Year. 2. Select ID in the Possible keys list and place it into the Keys for merge list: There are five major methods of merging using a key field: Include only matching records (inner join) merges only complete records, that is, records; that are available in all datasets. Include matching and non-matching records (full outer join) merges records that appear in any of the datasets; that is, the incomplete records are still retained. The undefined value ($null$) is added to the missing fields and included in the output. Include matching and selected non-matching records (partial outerjoin) performs left and right outer-joins. All records from the specified file are retained, along with only those records from the other file(s) that match records in the specified file on the key field(s). The Select... button allows you to designate which file is to contribute incomplete records. Include records in first dataset not matching any others (anti-join) provides an easy way of identifying records in a dataset that do not have records with the same key values in any of the other datasets involved in the merge. This option only retains records from the dataset that match with no other records. Combine duplicate key fields is the final option in this dialog, and it deals with the problem of duplicate field names (one from each dataset) when key fields are used. This option ensures that there is only one output field with a given name, and this is enabled by default. The Filter tab The Filter tab lists the data sources involved in the merge, and the ordering of the sources determines the field ordering of the merged data. Here, you can rename and remove fields. Earlier, we saw that the field Year appeared in both datasets; here we can remove one version of this field (we could also rename one version of the field to keep both): Click on the arrow next to the second Year field: The second Year field will no longer appear in the combined data file. The Optimization tab The Optimization tab provides two options that allow you to merge data more efficiently when one input dataset is significantly larger than the other datasets, or when the data is already presorted by all or some of the key fields that you are using to merge: Click OK. Run the Table: All of these files have now been combined. The resulting table should have 44 fields and 143,531 records. We saw how the Merge node is used to join data files that contain different information for the same records. If you found this post useful, make sure to check out IBM SPSS Modeler Essentials for more information on leveraging SPSS Modeler to get faster and efficient insights from your data.

0
0
11713

article-image-working-with-kafka-streams

Amarabha Banerjee

22 Feb 2018

6 min read

Working with Kafka Streams

Amarabha Banerjee

22 Feb 2018

6 min read

This article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book will simplify real-time data processing by leveraging Apache Kafka 1.0. In today’s tutorial we are going to discuss how to work with Apache Kafka Streams efficiently. In the data world, a stream is linked to the most important abstractions. A stream depicts a continuously updating and unbounded process. Here, unbounded means unlimited size. By definition, a stream is a fault-tolerant, replayable, and ordered sequence of immutable data records. A data record is defined as a key-value pair. Before we proceed, some concepts need to be defined: Stream processing application: Any program that utilizes the Kafka streams library is known as a stream processing application. Processor topology: This is a topology that defines the computational logic of the data processing that a stream processing application requires to be performed. A topology is a graph of stream processors (nodes) connected by streams (edges). There are two ways to define a topology: Via the low-level processor API Via the Kafka streams DSL Stream processor: This is a node present in the processor topology. It represents a processing step in a topology and is used to transform data in streams. The standard operations—filter, join, map, and aggregations—are examples of stream processors available in Kafka streams. Windowing: Sometimes, data records are divided into time buckets by a stream processor to window the stream by time. This is usually required for aggregation and join operations. Join: When two or more streams are merged based on the keys of their data records, a new stream is generated. The operation that generates this new stream is called a join. A join over record streams is usually required to be performed on a windowing basis. Aggregation: A new stream is generated by combining multiple input records into a single output record, by taking one input stream. The operation that creates this new stream is known as aggregation. Examples of aggregations are sums and counts. Setting up the project This recipe sets the project to use Kafka streams in the Treu application project. Getting ready The project generated in the first four chapters is needed. How to do it Open the build.gradle file on the Treu project generated in Chapter 4, Message Enrichment, and add these lines: apply plugin: 'java' apply plugin: 'application' sourceCompatibility = '1.8' mainClassName = 'treu.StreamingApp' repositories { mavenCentral() } version = '0.1.0' dependencies { compile 'org.apache.kafka:kafka-clients:1.0.0' compile 'org.apache.kafka:kafka-streams:1.0.0' compile 'org.apache.avro:avro:1.7.7' } jar { manifest { attributes 'Main-Class': mainClassName } from { configurations.compile.collect { it.isDirectory() ? it : zipTree(it) } } { exclude "META-INF/*.SF" exclude "META-INF/*.DSA" exclude "META-INF/*.RSA" } } To rebuild the app, from the project root directory, run this command: $ gradle jar The output is something like: ... BUILD SUCCESSFUL Total time: 24.234 secs As the next step, create a file called StreamingApp.java in the src/main/java/treu directory with the following contents: package treu; import org.apache.kafka.streams.StreamsBuilder; import org.apache.kafka.streams.Topology; import org.apache.kafka.streams.KafkaStreams; import org.apache.kafka.streams.StreamsConfig; import org.apache.kafka.streams.kstream.KStream; import java.util.Properties; public class StreamingApp { public static void main(String[] args) throws Exception { Properties props = new Properties(); props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streaming_app_id");// 1 props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); //2 StreamsConfig config = new StreamsConfig(props); // 3 StreamsBuilder builder = new StreamsBuilder(); //4 Topology topology = builder.build(); KafkaStreams streams = new KafkaStreams(topology, config); KStream<String, String> simpleFirstStream = builder.stream("src-topic"); //5 KStream<String, String> upperCasedStream = simpleFirstStream.mapValues(String::toUpperCase); //6 upperCasedStream.to("out-topic"); //7 System.out.println("Streaming App Started"); streams.start(); Thread.sleep(30000); //8 System.out.println("Shutting down the Streaming App"); streams.close(); } } How it works Follow the comments in the code: In line //1, the APPLICATION_ID_CONFIG is an identifier for the app inside the broker In line //2, the BOOTSTRAP_SERVERS_CONFIG specifies the broker to use In line //3, the StreamsConfig object is created, it is built with the properties specified In line //4, the StreamsBuilder object is created, it is used to build a topology In line //5, when KStream is created, the input topic is specified In line //6, another KStream is created with the contents of the src-topic but in uppercase In line //7, the uppercase stream should write the output to out-topic In line //8, the application will run for 30 seconds Running the streaming application In the previous recipe, the first version of the streaming app was coded. Now, in this recipe, everything is compiled and executed. Getting ready The execution of the previous recipe of this chapter is needed. How to do it The streaming app doesn't receive arguments from the command line: To build the project, from the treu directory, run the following command: $ gradle jar If everything is OK, the output should be: ... BUILD SUCCESSFUL Total time: … To run the project, we have four different command-line windows. The following diagram shows what the arrangement of command-line windows should look like: In the first command-line Terminal, run the control center: $ <confluent-path>/bin/confluent start In the second command-line Terminal, create the two topics needed: $ bin/kafka-topics --create --topic src-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1 $ bin/kafka-topics --create --topic out-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1 In that command-line Terminal, start the producer: $ bin/kafka-console-producer --broker-list localhost:9092 --topic src-topic This window is where the input messages are typed. In the third command-line Terminal, start a consumer script listening to outtopic: $ bin/kafka-console-consumer --bootstrap-server localhost:9092 -- from-beginning --topic out-topic In the fourth command-line Terminal, start up the processing application. Go the project root directory (where the Gradle jar command was executed) and run: $ java -jar ./build/libs/treu-0.1.0.jar localhost:9092 Go to the second command-line Terminal (console-producer) and send the following three messages (remember to press Enter between messages and execute each one in just one line): $> Hello [Enter] $> Kafka [Enter] $> Streams [Enter] The messages typed in console-producer should appear uppercase in the outtopic console consumer window: > HELLO > KAFKA > STREAMS We discussed about the Apache Kafka streams and how to get up and running with it. If you liked this post, be sure to check out Apache Kafka 1.0 Cookbook which consists of more useful recipes to work with Apache Kafka installation.

0
0
17880

article-image-build-data-streams-ibm-spss-modeler

Amey Varangaonkar

22 Feb 2018

7 min read

How to build data streams in IBM SPSS Modeler

Amey Varangaonkar

22 Feb 2018

7 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book IBM SPSS Modeler Essentials, co-authored by Keith McCormick and Jesus Salcedo. This book gives you a quick overview of the fundamental concepts of data mining and how to put them to practical use with the help of SPSS Modeler.[/box] SPSS Modeler allows users to mine data visually on the stream canvas. This means that you will not be writing code for your data mining projects; instead you will be placing nodes on the stream canvas. Remember that nodes represent operations to be carried out on the data. So once nodes have been placed on the stream canvas, they need to be linked together to form a stream. A stream represents the flow of data going through a number of operations (nodes). The following diagram is an example of nodes on the canvas, as well as a stream: Given that you will spend a lot of time building streams, in this section you will learn the most efficient ways of manipulating nodes to create a stream. Mouse buttons When building streams, mouse buttons are used extensively so that nodes can be brought onto the canvas, connected, edited, and so on. When building streams within Modeler, mouse buttons are used in the following ways: The left button is used for selecting, placing, and positioning nodes on the stream Canvas The right button is used for invoking context (pop-up) menus that allow for editing, connecting, renaming, deleting, and running nodes The middle button (optional) is used for connecting and disconnecting nodes Adding nodes To begin a new stream, a node from the Sources palette needs to be placed on the stream canvas. There are three ways to add nodes to a stream from a palette: Method one: Click on palette and then on stream: Click on the Sources palette. Click on the Var. File node. This will cause the icon to be highlighted. Move the cursor over the stream canvas. Click anywhere in the stream canvas. A copy of the selected node now appears on the stream canvas. This node represents the action of reading data into Modeler from a delimited text data file. If you wish to move the node within the stream canvas, select it by clicking on the node, and while holding the left mouse button down, drag the node to a new position. Method two: Drag and drop: Now go back to the Sources palette. Click on the Statistics File node and drag and drop this node onto the canvas. The Statistics File node represents the action of reading data into Modeler from an IBM SPSS Statistics data file. Method three: Double-click: Go back to the Sources palette one more time. Double click on the Database node. The Database node represents the action of reading data into Modeler from an ODBC compliant database. Editing nodes Once a node has been brought onto the stream canvas, typically at this point you will want to edit the node so that you can specify which fields, cases, or files you want the node to apply to. There are two ways to edit a node. Method one: Right-click on a node: Right-click on the Var. File node: Notice that there are many things you can do within this context menu. You can edit, add comments, copy, delete a node, connect nodes, and so on. Most often you will probably either edit the node or connect nodes. Method two: Double-click on a node: Double-click on the Var. File node. This bypasses the context menu we saw previously, and goes directly into the node itself so we can edit it. Deleting nodes There will be times when you will want to delete a node that you have on the stream canvas. There are two ways to delete a node. Method one: Right-click on a node: Right-click on the Database File node. Select Delete. The node is now deleted. Method two: Use the Delete button from the keyboard: Click on the Statistics File node. Click on the Delete button on the keyboard. Building a stream When two or more nodes have been placed on the stream canvas, they need to be connected to produce a stream. This can be thought of as representing the flow of data through the nodes. To demonstrate this, we will place a Table node on the stream canvas next to the Var. File node. The Table node presents data in a table format. Click the Output palette to activate it. Click on the Table node. Place this node to the right of the Var. File node by clicking in the stream canvas: At this point, we now have two nodes on the stream canvas, however, we technically do not have a stream because the nodes are not speaking to each other (that is, they are not connected). Connecting nodes In order for nodes to work together, they must be connected. Connecting nodes allows you to bring data into Modeler, explore the data, manipulate the data (to either clean it up or create additional fields), build a model, evaluate the model, and ultimately score the data. There are three main ways to connect nodes to form a stream that is, double-clicking, using the middle mouse button, or manually: Method one: Double-click. The simplest way to form a stream is to double-click on nodes on a palette. This method automatically connects the new node to the currently selected node on the stream canvas: Select the Var. File node that is on the stream canvas Double-click the Table node from the Output palette This action automatically connects the Table node to the existing Var. File node, and a connecting arrow appears between the nodes. The head of the arrow indicates the direction of the data flow. Method two: Manually. To manually connect two nodes: Bring a Table node onto the canvas. Right-click on the Var. File node. Select Connect from the context menu. Click the Table node. Method three: Middle mouse button. To use the middle mouse button: Bring a Table node onto the canvas. Use the middle mouse button to click on the Var. File node. While holding the middle mouse button down, drag the cursor over to the Table node. Release the middle mouse button. Deleting connections When you know that you are no longer going to use a node, you can delete it. Often, though, you may not want to delete a node; instead you might want to delete a connection. Deleting a node completely gets rid of the node. Deleting a connection allows you to keep a node with all the edits you have done, but for now the unconnected node will not be part of the stream. Nodes can be disconnected in several ways: Method one: Delete the connecting arrow: Right-click on the connecting arrow. Click Delete Connection. Method two: Right-click on a node: Right-click on one of the nodes that has a connection. Select Disconnect from the Context menu. Method three: Double-clicking: Double-click with the middle mouse button on a node that has a connection. All connections to this node will be severed, but the connections to neighboring nodes will be intact. Thus, we saw it’s fairly easy to build and manage data streams in SPSS Modeler. If you found the above excerpt useful, make sure to check out our book IBM SPSS Modeler Essentials for more tips and tricks on effective data mining.

0
0
5473

article-image-implementing-gradient-descent-algorithm-to-solve-optimization-problems

Sunith Shetty

22 Feb 2018

7 min read

Implementing gradient descent algorithm to solve optimization problems

Sunith Shetty

22 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Rajdeep Dua and Manpreet Singh Ghotra titled Neural Network Programming with Tensorflow. In this book, you will learn to leverage the power of TensorFlow to train neural networks of varying complexities, without any hassle.[/box] Today we will focus on the gradient descent algorithm and its different variants. We will take a simple example of linear regression to solve the optimization problem. Gradient descent is the most successful optimization algorithm. As mentioned earlier, it is used to do weights updates in a neural network so that we minimize the loss function. Let's now talk about an important neural network method called backpropagation, in which we firstly propagate forward and calculate the dot product of inputs with their corresponding weights, and then apply an activation function to the sum of products which transforms the input to an output and adds non linearities to the model, which enables the model to learn almost any arbitrary functional mappings. Later, we back propagate in the neural network, carrying error terms and updating weights values using gradient descent, as shown in the following graph: Different variants of gradient descent Standard gradient descent, also known as batch gradient descent, will calculate the gradient of the whole dataset but will perform only one update. Therefore, it can be quite slow and tough to control for datasets which are extremely large and don't fit in the memory. Let's now look at algorithms that can solve this problem. Stochastic gradient descent (SGD) performs parameter updates on each training example, whereas mini batch performs an update with n number of training examples in each batch. The issue with SGD is that, due to the frequent updates and fluctuations, it eventually complicates the convergence to the accurate minimum and will keep exceeding due to regular fluctuations. Mini-batch gradient descent comes to the rescue here, which reduces the variance in the parameter update, leading to a much better and stable convergence. SGD and mini-batch are used interchangeably. Overall problems with gradient descent include choosing a proper learning rate so that we avoid slow convergence at small values, or divergence at larger values and applying the same learning rate to all parameter updates wherein if the data is sparse we might not want to update all of them to the same extent. Lastly, is dealing with saddle points. Algorithms to optimize gradient descent We will now be looking at various methods for optimizing gradient descent in order to calculate different learning rates for each parameter, calculate momentum, and prevent decaying learning rates. To solve the problem of high variance oscillation of the SGD, a method called momentum was discovered; this accelerates the SGD by navigating along the appropriate direction and softening the oscillations in irrelevant directions. Basically, it adds a fraction of the update vector of the past step to the current update vector. Momentum value is usually set to .9. Momentum leads to a faster and stable convergence with reduced oscillations. Nesterov accelerated gradient explains that as we reach the minima, that is, the lowest point on the curve, momentum is quite high and it doesn't know to slow down at that point due to the large momentum which could cause it to miss the minima entirely and continue moving up. Nesterov proposed that we first make a long jump based on the previous momentum, then calculate the gradient and then make a correction which results in a parameter update. Now, this update prevents us to go too fast and not miss the minima, and makes it more responsive to changes. Adagrad allows the learning rate to adapt based on the parameters. Therefore, it performs large updates for infrequent parameters and small updates for frequent parameters. Therefore, it is very well-suited for dealing with sparse data. The main flaw is that its learning rate is always decreasing and decaying. Problems with decaying learning rates are solved using AdaDelta. AdaDelta solves the problem of decreasing learning rate in AdaGrad. In AdaGrad, the learning rate is computed as one divided by the sum of square roots. At each stage, we add another square root to the sum, which causes the denominator to decrease constantly. Now, instead of summing all prior square roots, it uses a sliding window which allows the sum to decrease. Adaptive Moment Estimation (Adam) computes adaptive learning rates for each parameter. Like AdaDelta, Adam not only stores the decaying average of past squared gradients but additionally stores the momentum change for each parameter. Adam works well in practice and is one of the most used optimization methods today. The following two images (image credit: Alec Radford) show the optimization behavior of optimization algorithms described earlier. We see their behavior on the contours of a loss surface over time. Adagrad, RMsprop, and Adadelta almost quickly head off in the right direction and converge fast, whereas momentum and NAG are headed off-track. NAG is soon able to correct its course due to its improved responsiveness by looking ahead and going to the minimum. The second image displays the behavior of the algorithms at a saddle point. SGD, Momentum, and NAG find it challenging to break symmetry, but slowly they manage to escape the saddle point, whereas Adagrad, Adadelta, and RMsprop head down the negative slope, as can seen from the following image: Which optimizer to choose In the case that the input data is sparse or if we want fast convergence while training complex neural networks, we get the best results using adaptive learning rate methods. We also don't need to tune the learning rate. For most cases, Adam is usually a good choice. Optimization with an example Let's take an example of linear regression, where we try to find the best fit for a straight line through a number of data points by minimizing the squares of the distance from the line to each data point. This is why we call it least squares regression. Essentially, we are formulating the problem as an optimization problem, where we are trying to minimize a loss function. Let's set up input data and look at the scatter plot: # input data xData = np.arange(100, step=.1) yData = xData + 20 * np.sin(xData/10) Define the data size and batch size: # define the data size and batch size nSamples = 1000 batchSize = 100 We will need to resize the data to meet the TensorFlow input format, as follows: # resize input for tensorflow xData = np.reshape(xData, (nSamples, 1)) yData = np.reshape(yData, (nSamples, 1)) The following scope initializes the weights and bias, and describes the linear model and loss function: with tf.variable_scope("linear-regression-pipeline"): W = tf.get_variable("weights", (1,1), initializer=tf.random_normal_initializer()) b = tf.get_variable("bias", (1, ), initializer=tf.constant_initializer(0.0)) # model yPred = tf.matmul(X, W) + b # loss function loss = tf.reduce_sum((y - yPred)**2/nSamples) We then set optimizers for minimizing the loss: # set the optimizer #optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) #optimizer = tf.train.AdamOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.AdadeltaOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.AdagradOptimizer(learning_rate=.001).minimize(loss) #optimizer = tf.train.MomentumOptimizer(learning_rate=.001, momentum=0.9).minimize(loss) #optimizer = tf.train.FtrlOptimizer(learning_rate=.001).minimize(loss) optimizer = tf.train.RMSPropOptimizer(learning_rate=.001).minimize(loss) We then select the mini batch and run the optimizers errors = [] with tf.Session() as sess: # init variables sess.run(tf.global_variables_initializer()) for _ in range(1000): # select mini batch indices = np.random.choice(nSamples, batchSize) xBatch, yBatch = xData[indices], yData[indices] # run optimizer _, lossVal = sess.run([optimizer, loss], feed_dict={X: xBatch, y: yBatch}) errors.append(lossVal) plt.plot([np.mean(errors[i-50:i]) for i in range(len(errors))]) plt.show() plt.savefig("errors.png") The output of the preceding code is as follows: We also get a sliding curve, as follows: We learned optimization is a complicated subject and a lot depends on the nature and size of our data. Also, optimization depends on weight matrices. A lot of these optimizers are trained and tuned for tasks like image classification or predictions. However, for custom or new use cases, we need to perform trial and error to determine the best solution. To know more about how to build and optimize neural networks using TensorFlow, do checkout this book Neural Network Programming with Tensorflow.

0
0
78900

article-image-fat-2018-conference-session-2-summary-interpretability-explainability

Savia Lobo

22 Feb 2018

5 min read

FAT* 2018 Conference Session 2 Summary: Interpretability and Explainability

Savia Lobo

22 Feb 2018

5 min read

This session of the FAT* 2018 is about interpretability and explainability in machine learning models. With the advances in Deep learning, machine learning models have become more accurate. However, with accuracy and advancements, it is a tough task to keep the models highly explainable. This means, these models may appear as black boxes to business users, who utilize them without knowing what lies within. Thus, it is equally important to make ML models interpretable and explainable, which can be beneficial and essential for understanding ML models and to have a ‘behind the scenes’ knowledge of what’s happening within them. This understanding can be highly essential for heavily regulated industries like Finance, Medicine, Defence and so on. The Conference on Fairness, Accountability, and Transparency (FAT), which would be held on the 23rd and 24th of February, 2018 is a multi-disciplinary conference that brings together researchers and practitioners interested in fairness, accountability, and transparency in socio-technical systems. The FAT 2018 conference will witness 17 research papers, 6 tutorials, and 2 keynote presentations from leading experts in the field. This article covers research papers pertaining to the 2nd session that is dedicated to Interpretability and Explainability of machine-learned decisions. If you’ve missed our summary of the 1st session on Online Discrimination and Privacy, visit the article link for a catch up. Paper 1: Meaningful Information and the Right to Explanation This paper addresses an active debate in policy, industry, academia, and the media about whether and to what extent Europe’s new General Data Protection Regulation (GDPR) grants individuals a “right to explanation” of automated decisions. The paper explores two major papers, European Union Regulations on Algorithmic Decision Making and a “Right to Explanation” by Goodman and Flaxman (2017) Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation by Wachter et al. (2017) This paper demonstrates that the specified framework is built on incorrect legal and technical assumptions. In addition to responding to the existing scholarly contributions, the article articulates a positive conception of the right to explanation, located in the text and purpose of the GDPR. The authors take a position that the right should be interpreted functionally, flexibly, and should, at a minimum, enable a data subject to exercise his or her rights under the GDPR and human rights law. Key takeaways: The first paper by Goodman and Flaxman states that GDPR creates a "right to explanation" but without any argument. The second paper is in response to the first paper, where Watcher et al. have published an extensive critique, arguing against the existence of such a right. The current paper, on the other hand, is partially concerned with responding to the arguments of Watcher et al. Paper 2: Interpretable Active Learning The paper tries to highlight how due to complex and opaque ML models, the process of active learning has also become opaque. Not much has been known about what specific trends and patterns, the active learning strategy may be exploring. The paper expands on explaining about LIME (Local Interpretable Model-agnostic Explanations framework) to provide explanations for active learning recommendations. The authors, Richard Phillips, Kyu Hyun Chang, and Sorelle A. Friedler, demonstrate uses of LIME in generating locally faithful explanations for an active learning strategy. Further, the paper shows how these explanations can be used to understand how different models and datasets explore a problem space over time. Key takeaways: The paper demonstrates how active learning choices can be made more interpretable to non-experts. It also discusses techniques that make active learning interpretable to expert labelers, so that queries and query batches can be explained and the uncertainty bias can be tracked via interpretable clusters. It showcases per-query explanations of uncertainty to develop a system that allows experts to choose whether to label a query. This will allow them to incorporate domain knowledge and their own interests into the labeling process. It introduces a quantified notion of uncertainty bias, the idea that an algorithm may be less certain about its decisions on some data clusters than others. Paper 3: Interventions over Predictions: Reframing the Ethical Debate for Actuarial Risk Assessment Actuarial risk assessments might be unduly perceived as a neutral way to counteract implicit bias and increase the fairness of decisions made within the criminal justice system, from pretrial release to sentencing, parole, and probation. However, recently, these assessments have come under increased scrutiny, as critics claim that the statistical techniques underlying them might reproduce existing patterns of discrimination and historical biases that are reflected in the data. The paper proposes that machine learning should not be used for prediction, but rather to surface covariates that are fed into a causal model for understanding the social, structural and psychological drivers of crime. The authors, Chelsea Barabas, Madars Virza, Karthik Dinakar, Joichi Ito (MIT), Jonathan Zittrain (Harvard), propose an alternative application of machine learning and causal inference away from predicting risk scores to risk mitigation. Key takeaways: The paper gives a brief overview of how risk assessments have evolved from a tool used solely for prediction to one that is diagnostic at its core. The paper places a debate around risk assessment in a broader context. One can get a fuller understanding of the way these actuarial tools have evolved to achieve a varied set of social and institutional agendas. It argues for a shift away from predictive technologies, towards diagnostic methods that will help in understanding the criminogenic effects of the criminal justice system itself, as well as evaluate the effectiveness of interventions designed to interrupt cycles of crime. It proposes that risk assessments, when viewed as a diagnostic tool, can be used to understand the underlying social, economic and psychological drivers of crime. The authors also posit that causal inference offers the best framework for pursuing the goals to achieve a fair and ethical risk assessment tool.

0
0
29623

article-image-exchange-management-shell-common-tasks

Packt

21 Feb 2018

11 min read

Exchange Management Shell Common Tasks

Packt

21 Feb 2018

11 min read

0
0
9201

Packt

21 Feb 2018

11 min read

Debugging Your.Net

Packt

21 Feb 2018

11 min read

0
0
13152

How-To Tutorials

article-image-getting-started-raspberry-pi

Packt

21 Feb 2018

7 min read

Getting Started on the Raspberry Pi

Packt

21 Feb 2018

7 min read

In this article, by Soham Chetan Kamani, author of the book Full Stack Web Development with Raspberry Pi 3, we will cover the marvel of the Raspberry Pi, however, doesn’t end here. It’s extreme portability means we can now do things which were not previously possible with traditional desktop computers. The GPIO pins give us easy access to interface with external devices. This allows the Pi to act as a bridge between embedded electronics and sensors, and the power that linux gives us. In essence, we can run any code in our favorite programming language (which can run on linux), and interface it directly to outside hardware quickly and easily. Once we couple this with the wireless networking capabilities introduced in the Raspberry Pi 3, we gain the ability to make applications that would not have been feasible to make before this device existed.and Scar de Courcier, authors of Windows Forensics Cookbook The Raspberry Pi has become hugely popular as a portable computer, and for good reason. When it comes to what you can do with this tiny piece of technology, the sky’s the limit. Back in the day, computers used to be the size of entire neighborhood blocks, and only large corporations doing expensive research could afford them. After that we went on to embrace personal computers, which were still a bit expensive, but, for the most part, could be bought by the common man. This brings us to where we are today, where we can buy a fully functioning Linux computer, which is as big as a credit card, for under 30$. It is truly a huge leap in making computers available to anyone and everyone. (For more resources related to this topic, see here.) Web development and portable computing have come a long way. A few years ago we couldn’t dream of making a rich, interactive, and performant application which runs on the browser. Today, not only can we do that, but also do it all in the palm of our hands (quite literally). When we think of developing an application that uses databases, application servers, sockets, and cloud APIs, the picture that normally comes to mind is that of many server racks sitting in a huge room. In this book however, we are going to implement all of that using only the Raspberry Pi. In this article, we will go through the concept of the internet of things, and discuss how web development on the Raspberry Pi can help us get there. Following this, we will also learn how to set up our Raspberry Pi and access it from our computer. We will cover the following topics: The internet of things Our application Setting up Raspberry Pi Remote access The Internet of things (IOT) The web has until today been a network of computers exchanging data. The limitation of this was that it was a closed loop. People could send and receive data from other people via their computers, but rarely much else. The internet of things, in contrast, is a network of devices or sensors that connect the outside world to the internet. Superficially, nothing is different: the internet is still a network of computers. What has changed, is that now, these computers are collecting and uploading data from things instead of people. This now allows anyone who is connected to obtain information that is not collected by a human. The internet of things as a concept has been around for a long time, but it is only now that almost anyone can connect a sensor or device to the cloud, and the IOT revolution was hugely enabled by the advent of portable computing, which was led by the Raspberry Pi. A brief look at our application Throughout this book, we are going to go through different components and aspects of web development and embedded systems. These are all going to be held together by our central goal of making an entire web application capable of sensing and displaying the surrounding temperature and humidity. In order to make a properly functioning system, we have to first build out the individual parts. More difficult still, is making sure all the parts work well together. Keeping this in mind, let's take a look at the different components of our technology stack, and the problems that each of them solve : The sensor interface - Perception The sensor is what connects our otherwise isolated application to the outside world. The sensor will be connected to the GPIO pins of the Raspberry pi. We can interface with the sensor through various different native libraries. This is the starting point of our data. It is where all the data that is used by our application is created. If you think about it, every other component of our technology stack only exists to manage, manipulate, and display the data collected from the sensor. The database - Persistence "Data" is the term we give to raw information, which is information that we cannot easily aggregate or understand. Without a way to store and meaningfully process and retrieve this data, it will always remain "data" and never "information", which is what we actually want. If we just hook up a sensor and display whatever data it reads, we are missing out on a lot of additional information. Let's take the example of temperature: What if we wanted to find out how the temperature was changing over time? What if we wanted to find the maximum and minimum temperatures for a particular day, or a particular week, or even within a custom duration of time? What if we wanted to see temperature variation across locations? There is no way we could do any of this with only the sensor. We also need some sort of persistence and structure to our data, and this is exactly what the database provides for us. If we structure our data correctly, getting the answers to the above questions is just a matter of a simple database query. The user interface - Presentation The user interface is the layer which connects our application to the end user. One of the most challenging aspects of software development is to make information meaningful and understandable to regular users of our application. The UI layer serves exactly this purpose: it takes relevant information and shows it in such a way that it is easily understandable to humans. How do we achieve such a level of understandability with such a large amount of data? We use visual aids: like colors, charts and diagrams (just like how the diagrams in this book make its information easier to understand). An important thing for any developer to understand is that your end user actually doesn't care about any of the the back-end stuff. The only thing that matters to them is a good experience. Of course, all the 0ther components serve to make the users experience better, but it's really the user facing interface that leaves the first impression, and that's why it's so important to do it well. The application server - Middleware This layer consists of the actual server side code we are going to write to get the application running. It is also called "middleware". In addition to being in the exact center of the architecture diagram, this layer also acts as the controller and middle-man for the other layers. The HTML pages that form the UI are served through this layer. All the database queries that we were talking about earlier are made here. The code that runs in this layer is responsible for retrieving the sensor readings from our external pins and storing the data in our database. Summary We are just warming up! In this article we got a brief introduction to the concept of the internet of things. We then went on to look at an overview of what we were going to build throughout the rest of this book, and saw how the Raspberry Pi can help us get there. Resources for Article: Further resources on this subject: Clusters, Parallel Computing, and Raspberry Pi – A Brief Background [article] Setting up your Raspberry Pi [article] Welcome to JavaScript in the full stack [article]

0
0
10346

How-To Tutorials

Packt

21 Feb 2018

9 min read

API Gateway and its Need

Packt

21 Feb 2018

9 min read

In this article by Umesh R Sharma, author of the book Practical Microservices, we will cover API Gateway and its need with simple and short examples. (For more resources related to this topic, see here.) Dynamic websites show a lot on a single page, and there is a lot of information that needs to be shown on the page. The common success order summary page shows the cart detail and customer address. For this, frontend has to fire a different query to the customer detail service and order detail service. This is a very simple example of having multiple services on a single page. As a single microservice has to deal with only one concern, in result of that to show much information on page, there are many API calls on the same page. So, a website or mobile page can be very chatty in terms of displaying data on the same page. Another problem is that, sometimes, microservice talks on another protocol, then HTTP only, such as thrift call and so on. Outer consumers can't directly deal with microservice in that protocol. As a mobile screen is smaller than a web page, the result of the data required by the mobile or desktop API call is different. A developer would want to give less data to the mobile API or have different versions of the API calls for mobile and desktop. So, you could face a problem such as this: each client is calling different web services and keeping track of their web service and developers have to give backward compatibility because API URLs are embedded in clients like in mobile app. Why do we need the API Gateway? All these preceding problems can be addressed with the API Gateway in place. The API Gateway acts as a proxy between the API consumer and the API servers. To address the first problem in that scenario, there will only be one call, such as /successOrderSummary, to the API Gateway. The API Gateway, on behalf of the consumer, calls the order and user detail, then combines the result and serves to the client. So basically, it acts as a facade or API call, which may internally call many APIs. The API Gateway solves many purposes, some of which are as follows. Authentication API Gateways can take the overhead of authenticating an API call from outside. After that, all the internal calls remove security check. If the request comes from inside the VPC, it can remove the check of security, decrease the network latency a bit, and make the developer focus more on business logic than concerning about security. Different protocol Sometimes, microservice can internally use different protocols to talk to each other; it can be thrift call, TCP, UDP, RMI, SOAP, and so on. For clients, there can be only one rest-based HTTP call. Clients hit the API Gateway with the HTTP protocol and the API Gateway can make the internal call in required protocol and combine the results in the end from all web service. It can respond to the client in required protocol; in most of the cases, that protocol will be HTTP. Load-balancing The API Gateway can work as a load balancer to handle requests in the most efficient manner. It can keep a track of the request load it has sent to different nodes of a particular service. Gateway should be intelligent enough to load balances between different nodes of a particular service. With NGINX Plus coming into the picture, NGINX can be a good candidate for the API Gateway. It has many of the features to address the problem that is usually handled by the API Gateway. Request dispatching (including service discovery) One main feature of the gateway is to make less communication between client and microservcies. So, it initiates the parallel microservices if that is required by the client. From the client side, there will only be one hit. Gateway hits all the required services and waits for the results from all services. After obtaining the response from all the services, it combines the result and sends it back to the client. Reactive microservice designs can help you achieve this. Working with service discovery can give many extra features. It can mention which is the master node of service and which is the slave. Same goes for DB in case any write request can go to the master or read request can go to the slave. This is the basic rule, but users can apply so many rules on the basis of meta information provided by the API Gateway. Gateway can record the basic response time from each node of service instance. For higher priority API calls, it can be routed to the fastest responding node. Again, rules can be defined on the basis of the API Gateway you are using and how it will be implemented. Response transformation Being a first and single point of entry for all API calls, the API Gateway knows which type of client is calling a mobile, web client, or other external consumer; it can make the internal call to the client and give the data to different clients as per needs and configuration. Circuit breaker To handle the partial failure, the API Gateway uses a technique called circuit breaker pattern. A service failure in one service can cause the cascading failure in the flow to all the service calls in stack. The API Gateway can keep an eye on some threshold for any microservice. If any service passes that threshold, it marks that API as open circuit and decides not to make the call for a configured time. Hystrix (by Netflix) served this purpose efficiently. Default value in this is failing of 20 requests in 5 seconds. Developers can also mention the fall back for this open circuit. This fall back can be of dummy service. Once API starts giving results as expected, then gateway marks it as a closed service again. Pros and cons of API Gateway Using the API Gateway itself has its own pros and cons. In the previous section, we have described the advantages of using the API Gateway already. I will still try to make them in points as the pros of the API Gateway. Pros Microservice can focus on business logic Clients can get all the data in a single hit Authentication, logging, and monitoring can be handled by the API Gateway Gives flexibility to use completely independent protocols in which clients and microservice can talk It can give tailor-made results, as per the clients needs It can handle partial failure Addition to the preceding mentioned pros, some of the trade-offs are also to use this pattern. Cons It can cause performance degrade due to lots of happenings on the API Gateway With this, discovery service should be implemented Sometimes, it becomes the single point of failure Managing routing is an overhead of the pattern Adding additional network hope in the call Overall. it increases the complexity of the system Too much logic implementation in this gateway will lead to another dependency problem So, before using the API Gateway, both of the aspects should be considered. Decision of including the API Gateway in the system increases the cost as well. Before putting effort, cost, and management in this pattern, it is recommended to analysis how much you can gain from it. Example of API Gateway In this example, we will try to show only sample product pages that will fetch the data from service product detail to give information about the product. This example can be increased in many aspects. Our focus of this example is to only show how the API Gateway pattern works; so we will try to keep this example simple and small. This example will be using Zuul from Netflix as an API Gateway. Spring also had an implementation of Zuul in it, so we are creating this example with Spring Boot. For a sample API Gateway implementation, we will be using http://start.spring.io/ to generate an initial template of our code. Spring initializer is the project from Spring to help beginners generate basic Spring Boot code. A user has to set a minimum configuration and can hit the Generate Project button. If any user wants to set more specific details regarding the project, then they can see all the configuration settings by clicking on the Switch to the full version button, as shown in the following screenshot: Let's create a controller in the same package of main application class and put the following code in the file: @SpringBootApplication @RestController public class ProductDetailConrtoller { @Resource ProductDetailService pdService; @RequestMapping(value = "/product/{id}") public ProductDetail getAllProduct( @PathParam("id") String id) { return pdService.getProductDetailById(id); } } In the preceding code, there is an assumption of the pdService bean that will interact with Spring data repository for product detail and get the result for the required product ID. Another assumption is that this service is running on port 10000. Just to make sure everything is running, a hit on a URL such as http://localhost:10000/product/1 should give some JSON as response. For the API Gateway, we will create another Spring Boot application with Zuul support. Zuul can be activated by just adding a simple @EnableZuulProxy annotation. The following is a simple code to start the simple Zuul proxy: @SpringBootApplication @EnableZuulProxy public class ApiGatewayExampleInSpring { public static void main(String[] args) { SpringApplication.run(ApiGatewayExampleInSpring.class, args); } } Rest all the things are managed in configuration. In the application.properties file of the API Gateway, the content will be something as follows: zuul.routes.product.path=/product/** zuul.routes.produc.url=http://localhost:10000 ribbon.eureka.enabled=false server.port=8080 With this configuration, we are defining rules such as this: for any request for a URL such as /product/xxx, pass this request to http://localhost:10000. For outer world, it will be like http://localhost:8080/product/1, which will internally be transferred to the 10000 port. If we defined a spring.application.name variable as product in product detail microservice, then we don't need to define the URL path property here (zuul.routes.product.path=/product/** ), as Zuul, by default, will make it a URL/product. The example taken here for an API Gateway is not very intelligent, but this is a very capable API Gateway. Depending on the routes, filter, and caching defined in the Zuul's property, one can make a very powerful API Gateway. Summary In this article, you learned about the API Gateway, its need, and its pros and cons with the code example. Resources for Article: Further resources on this subject: What are Microservices? [article] Microservices and Service Oriented Architecture [article] Breaking into Microservices Architecture [article]

0
0
31663

FAT Conference 2018 Session 3: Fairness in Computer Vision and NLP

How to implement Dynamic SQL in PostgreSQL 10

Working with pandas DataFrames

Getting Started with Apache Kafka Clusters

Introduction to WordPress Plugin

How to perform Numeric Metric Aggregations with Elasticsearch

How to combine data files within IBM SPSS Modeler

Working with Kafka Streams

How to build data streams in IBM SPSS Modeler

Implementing gradient descent algorithm to solve optimization problems

Trending Topics

FAT* 2018 Conference Session 2 Summary: Interpretability and Explainability

Exchange Management Shell Common Tasks

Debugging Your.Net

Getting Started on the Raspberry Pi

API Gateway and its Need

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access