Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7018 Articles
article-image-basics-programming-julia
Packt
03 Mar 2015
17 min read
Save for later

Basics of Programming in Julia

Packt
03 Mar 2015
17 min read
 In this article by Ivo Balbaert, author of the book Getting Started with Julia Programming, we will explore how Julia interacts with the outside world, reading from standard input and writing to standard output, files, networks, and databases. Julia provides asynchronous networking I/O using the libuv library. We will see how to handle data in Julia. We will also discover the parallel processing model of Julia. In this article, the following topics are covered: Working with files (including the CSV files) Using DataFrames (For more resources related to this topic, see here.) Working with files To work with files, we need the IOStream type. IOStream is a type with the supertype IO and has the following characteristics: The fields are given by names(IOStream) 4-element Array{Symbol,1}:  :handle   :ios    :name   :mark The types are given by IOStream.types (Ptr{None}, Array{Uint8,1}, String, Int64) The file handle is a pointer of the type Ptr, which is a reference to the file object. Opening and reading a line-oriented file with the name example.dat is very easy: // code in Chapter 8io.jl fname = "example.dat"                                 f1 = open(fname) fname is a string that contains the path to the file, using escaping of special characters with when necessary; for example, in Windows, when the file is in the test folder on the D: drive, this would become d:\test\example.dat. The f1 variable is now an IOStream(<file example.dat>) object. To read all lines one after the other in an array, use data = readlines(f1), which returns 3-element Array{Union(ASCIIString,UTF8String),1}: "this is line 1.rn" "this is line 2.rn" "this is line 3." For processing line by line, now only a simple loop is needed: for line in data   println(line) # or process line end close(f1) Always close the IOStream object to clean and save resources. If you want to read the file into one string, use readall. Use this only for relatively small files because of the memory consumption; this can also be a potential problem when using readlines. There is a convenient shorthand with the do syntax for opening a file, applying a function process, and closing it automatically. This goes as follows (file is the IOStream object in this code): open(fname) do file     process(file) end The do command creates an anonymous function, and passes it to open. Thus, the previous code example would have been equivalent to open(process, fname). Use the same syntax for processing a file fname line by line without the memory overhead of the previous methods, for example: open(fname) do file     for line in eachline(file)         print(line) # or process line     end end Writing a file requires first opening it with a "w" flag, then writing strings to it with write, print, or println, and then closing the file handle that flushes the IOStream object to the disk: fname =   "example2.dat" f2 = open(fname, "w") write(f2, "I write myself to a filen") # returns 24 (bytes written) println(f2, "even with println!") close(f2) Opening a file with the "w" option will clear the file if it exists. To append to an existing file, use "a". To process all the files in the current folder (or a given folder as an argument to readdir()), use this for loop: for file in readdir()   # process file end Reading and writing CSV files A CSV file is a comma-separated file. The data fields in each line are separated by commas "," or another delimiter such as semicolons ";". These files are the de-facto standard for exchanging small and medium amounts of tabular data. Such files are structured so that one line contains data about one data object, so we need a way to read and process the file line by line. As an example, we will use the data file Chapter 8winequality.csv that contains 1,599 sample measurements, 12 data columns, such as pH and alcohol per sample, separated by a semicolon. In the following screenshot, you can see the top 20 rows:   In general, the readdlm function is used to read in the data from the CSV files: # code in Chapter 8csv_files.jl: fname = "winequality.csv" data = readdlm(fname, ';') The second argument is the delimiter character (here, it is ;). The resulting data is a 1600x12 Array{Any,2} array of the type Any because no common type could be found:     "fixed acidity"   "volatile acidity"      "alcohol"   "quality"      7.4                        0.7                                9.4              5.0      7.8                        0.88                              9.8              5.0      7.8                        0.76                              9.8              5.0   … If the data file is comma separated, reading it is even simpler with the following command: data2 = readcsv(fname) The problem with what we have done until now is that the headers (the column titles) were read as part of the data. Fortunately, we can pass the argument header=true to let Julia put the first line in a separate array. It then naturally gets the correct datatype, Float64, for the data array. We can also specify the type explicitly, such as this: data3 = readdlm(fname, ';', Float64, 'n', header=true) The third argument here is the type of data, which is a numeric type, String or Any. The next argument is the line separator character, and the fifth indicates whether or not there is a header line with the field (column) names. If so, then data3 is a tuple with the data as the first element and the header as the second, in our case, (1599x12 Array{Float64,2}, 1x12 Array{String,2}) (There are other optional arguments to define readdlm, see the help option). In this case, the actual data is given by data3[1] and the header by data3[2]. Let's continue working with the variable data. The data forms a matrix, and we can get the rows and columns of data using the normal array-matrix syntax). For example, the third row is given by row3 = data[3, :] with data:  7.8  0.88  0.0  2.6  0.098  25.0  67.0  0.9968  3.2  0.68  9.8  5.0, representing the measurements for all the characteristics of a certain wine. The measurements of a certain characteristic for all wines are given by a data column, for example, col3 = data[ :, 3] represents the measurements of citric acid and returns a column vector 1600-element Array{Any,1}:   "citric acid" 0.0  0.0  0.04  0.56  0.0  0.0 …  0.08  0.08  0.1  0.13  0.12  0.47. If we need columns 2-4 (volatile acidity to residual sugar) for all wines, extract the data with x = data[:, 2:4]. If we need these measurements only for the wines on rows 70-75, get these with y = data[70:75, 2:4], returning a 6 x 3 Array{Any,2} outputas follows: 0.32   0.57  2.0 0.705  0.05  1.9 … 0.675  0.26  2.1 To get a matrix with the data from columns 3, 6, and 11, execute the following command: z = [data[:,3] data[:,6] data[:,11]] It would be useful to create a type Wine in the code. For example, if the data is to be passed around functions, it will improve the code quality to encapsulate all the data in a single data type, like this: type Wine     fixed_acidity::Array{Float64}     volatile_acidity::Array{Float64}     citric_acid::Array{Float64}     # other fields     quality::Array{Float64} end Then, we can create objects of this type to work with them, like in any other object-oriented language, for example, wine1 = Wine(data[1, :]...), where the elements of the row are splatted with the ... operator into the Wine constructor. To write to a CSV file, the simplest way is to use the writecsv function for a comma separator, or the writedlm function if you want to specify another separator. For example, to write an array data to a file partial.dat, you need to execute the following command: writedlm("partial.dat", data, ';') If more control is necessary, you can easily combine the more basic functions from the previous section. For example, the following code snippet writes 10 tuples of three numbers each to a file: // code in Chapter 8tuple_csv.jl fname = "savetuple.csv" csvfile = open(fname,"w") # writing headers: write(csvfile, "ColName A, ColName B, ColName Cn") for i = 1:10   tup(i) = tuple(rand(Float64,3)...)   write(csvfile, join(tup(i),","), "n") end close(csvfile) Using DataFrames If you measure n variables (each of a different type) of a single object of observation, then you get a table with n columns for each object row. If there are m observations, then we have m rows of data. For example, given the student grades as data, you might want to know "compute the average grade for each socioeconomic group", where grade and socioeconomic group are both columns in the table, and there is one row per student. The DataFrame is the most natural representation to work with such a (m x n) table of data. They are similar to pandas DataFrames in Python or data.frame in R. A DataFrame is a more specialized tool than a normal array for working with tabular and statistical data, and it is defined in the DataFrames package, a popular Julia library for statistical work. Install it in your environment by typing in Pkg.add("DataFrames") in the REPL. Then, import it into your current workspace with using DataFrames. Do the same for the packages DataArrays and RDatasets (which contains a collection of example datasets mostly used in the R literature). A common case in statistical data is that data values can be missing (the information is not known). The DataArrays package provides us with the unique value NA, which represents a missing value, and has the type NAtype. The result of the computations that contain the NA values mostly cannot be determined, for example, 42 + NA returns NA. (Julia v0.4 also has a new Nullable{T} type, which allows you to specify the type of a missing value). A DataArray{T} array is a data structure that can be n-dimensional, behaves like a standard Julia array, and can contain values of the type T, but it can also contain the missing (Not Available) values NA and can work efficiently with them. To construct them, use the @data macro: // code in Chapter 8dataarrays.jl using DataArrays using DataFrames dv = @data([7, 3, NA, 5, 42]) This returns 5-element DataArray{Int64,1}: 7  3   NA  5 42. The sum of these numbers is given by sum(dv) and returns NA. One can also assign the NA values to the array with dv[5] = NA; then, dv becomes [7, 3, NA, 5, NA]). Converting this data structure to a normal array fails: convert(Array, dv) returns ERROR: NAException. How to get rid of these NA values, supposing we can do so safely? We can use the dropna function, for example, sum(dropna(dv)) returns 15. If you know that you can replace them with a value v, use the array function: repl = -1 sum(array(dv, repl)) # returns 13 A DataFrame is a kind of an in-memory database, versatile in the ways you can work with the data. It consists of columns with names such as Col1, Col2, Col3, and so on. Each of these columns are DataArrays that have their own type, and the data they contain can be referred to by the column names as well, so we have substantially more forms of indexing. Unlike two-dimensional arrays, columns in a DataFrame can be of different types. One column might, for instance, contain the names of students and should therefore be a string. Another column could contain their age and should be an integer. We construct a DataFrame from the program data as follows: // code in Chapter 8dataframes.jl using DataFrames # constructing a DataFrame: df = DataFrame() df[:Col1] = 1:4 df[:Col2] = [e, pi, sqrt(2), 42] df[:Col3] = [true, false, true, false] show(df) Notice that the column headers are used as symbols. This returns the following 4 x 3 DataFrame object: We could also have used the full constructor as follows: df = DataFrame(Col1 = 1:4, Col2 = [e, pi, sqrt(2), 42],    Col3 = [true, false, true, false]) You can refer to the columns either by an index (the column number) or by a name, both of the following expressions return the same output: show(df[2]) show(df[:Col2]) This gives the following output: [2.718281828459045, 3.141592653589793, 1.4142135623730951,42.0] To show the rows or subsets of rows and columns, use the familiar splice (:) syntax, for example: To get the first row, execute df[1, :]. This returns 1x3 DataFrame.  | Row | Col1 | Col2    | Col3 |  |-----|------|---------|------|  | 1   | 1    | 2.71828 | true | To get the second and third row, execute df [2:3, :] To get only the second column from the previous result, execute df[2:3, :Col2]. This returns [3.141592653589793, 1.4142135623730951]. To get the second and third column from the second and third row, execute df[2:3, [:Col2, :Col3]], which returns the following output: 2x2 DataFrame  | Row | Col2    | Col3  |  |---- |-----   -|-------|  | 1   | 3.14159 | false |  | 2   | 1.41421 | true  | The following functions are very useful when working with DataFrames: The head(df) and tail(df) functions show you the first six and the last six lines of data respectively. The names function gives the names of the columns names(df). It returns 3-element Array{Symbol,1}:  :Col1  :Col2  :Col3. The eltypes function gives the data types of the columns eltypes(df). It gives the output as 3-element Array{Type{T<:Top},1}:  Int64  Float64  Bool. The describe function tries to give some useful summary information about the data in the columns, depending on the type, for example, describe(df) gives for column 2 (which is numeric) the min, max, median, mean, number, and percentage of NAs: Col2 Min      1.4142135623730951 1st Qu.  2.392264761937558  Median   2.929937241024419 Mean     12.318522011105483  3rd Qu.  12.856194490192344  Max      42.0  NAs      0  NA%      0.0% To load in data from a local CSV file, use the method readtable. The returned object is of type DataFrame: // code in Chapter 8dataframes.jl using DataFrames fname = "winequality.csv" data = readtable(fname, separator = ';') typeof(data) # DataFrame size(data) # (1599,12) Here is a fraction of the output: The readtable method also supports reading in gzipped CSV files. Writing a DataFrame to a file can be done with the writetable function, which takes the filename and the DataFrame as arguments, for example, writetable("dataframe1.csv", df). By default, writetable will use the delimiter specified by the filename extension and write the column names as headers. Both readtable and writetable support numerous options for special cases. Refer to the docs for more information (refer to http://dataframesjl.readthedocs.org/en/latest/). To demonstrate some of the power of DataFrames, here are some queries you can do: Make a vector with only the quality information data[:quality] Give the wines with alcohol percentage equal to 9.5, for example, data[ data[:alcohol] .== 9.5, :] Here, we use the .== operator, which does element-wise comparison. data[:alcohol] .== 9.5 returns an array of Boolean values (true for datapoints, where :alcohol is 9.5, and false otherwise). data[boolean_array, : ] selects those rows where boolean_array is true. Count the number of wines grouped by quality with by(data, :quality, data -> size(data, 1)), which returns the following: 6x2 DataFrame | Row | quality | x1  | |-----|---------|-----| | 1    | 3      | 10  | | 2    | 4      | 53  | | 3    | 5      | 681 | | 4    | 6      | 638 | | 5    | 7      | 199 | | 6    | 8      | 18  | The DataFrames package contains the by function, which takes in three arguments: A DataFrame, here it takes data A column to split the DataFrame on, here it takes quality A function or an expression to apply to each subset of the DataFrame, here data -> size(data, 1), which gives us the number of wines for each quality value Another easy way to get the distribution among quality is to execute the histogram hist function hist(data[:quality]) that gives the counts over the range of quality (2.0:1.0:8.0,[10,53,681,638,199,18]). More precisely, this is a tuple with the first element corresponding to the edges of the histogram bins, and the second denoting the number of items in each bin. So there are, for example, 10 wines with quality between 2 and 3, and so on. To extract the counts as a variable count of type Vector, we can execute _, count = hist(data[:quality]); the _ means that we neglect the first element of the tuple. To obtain the quality classes as a DataArray class, we will execute the following: class = sort(unique(data[:quality])) We can now construct a df_quality DataFrame with the class and count columns as df_quality = DataFrame(qual=class, no=count). This gives the following output: 6x2 DataFrame | Row | qual | no  | |-----|------|-----| | 1   | 3    | 10  | | 2   | 4    | 53  | | 3   | 5    | 681 | | 4   | 6    | 638 | | 5   | 7    | 199 | | 6   | 8    | 18  | To deepen your understanding and learn about the other features of Julia DataFrames (such as joining, reshaping, and sorting), refer to the documentation available at http://dataframesjl.readthedocs.org/en/latest/. Other file formats Julia can work with other human-readable file formats through specialized packages: For JSON, use the JSON package. The parse method converts the JSON strings into Dictionaries, and the json method turns any Julia object into a JSON string. For XML, use the LightXML package For YAML, use the YAML package For HDF5 (a common format for scientific data), use the HDF5 package For working with Windows INI files, use the IniFile package Summary In this article we discussed the basics of network programming in Julia. Resources for Article: Further resources on this subject: Getting Started with Electronic Projects? [article] Getting Started with Selenium Webdriver and Python [article] Handling The Dom In Dart [article]
Read more
  • 0
  • 0
  • 18945

article-image-getting-started-postgresql
Packt
03 Mar 2015
11 min read
Save for later

Getting Started with PostgreSQL

Packt
03 Mar 2015
11 min read
In this article by Ibrar Ahmed, Asif Fayyaz, and Amjad Shahzad, authors of the book PostgreSQL Developer's Guide, we will come across the basic features and functions of PostgreSQL, such as writing queries using psql, data definition in tables, and data manipulation from tables. (For more resources related to this topic, see here.) PostgreSQL is widely considered to be one of the most stable database servers available today, with multiple features that include: A wide range of built-in types MVCC New SQL enhancements, including foreign keys, primary keys, and constraints Open source code, maintained by a team of developers Trigger and procedure support with multiple procedural languages Extensibility in the sense of adding new data types and the client language From the early releases of PostgreSQL (from version 6.0 that is), many changes have been made, with each new major version adding new and more advanced features. The current version is PostgreSQL 9.4 and is available from several sources and in various binary formats. Writing queries using psql Before proceeding, allow me to explain to you that throughout this article, we will use a warehouse database called warehouse_db. In this section, I will show you how you can create such a database, providing you with sample code for assistance. You will need to do the following: We are assuming here that you have successfully installed PostgreSQL and faced no issues. Now, you will need to connect with the default database that is created by the PostgreSQL installer. To do this, navigate to the default path of installation, which is /opt/PostgreSQL/9.4/bin from your command line, and execute the following command that will prompt for a postgres user password that you provided during the installation: /opt/PostgreSQL/9.4/bin$./psql -U postgres Password for user postgres: Using the following command, you can log in to the default database with the user postgres and you will be able to see the following on your command line: psql (9.4beta1) Type "help" for help postgres=# You can then create a new database called warehouse_db using the following statement in the terminal: postgres=# CREATE DATABASE warehouse_db; You can then connect with the warehouse_db database using the following command: postgres=# c warehouse_db You are now connected to the warehouse_db database as the user postgres, and you will have the following warehouse_db shell: warehouse_db=# Let's summarize what we have achieved so far. We are now able to connect with the default database postgres and created a warehouse_db database successfully. It's now time to actually write queries using psql and perform some Data Definition Language (DDL) and Data Manipulation Language (DML) operations, which we will cover in the following sections. In PostgreSQL, we can have multiple databases. Inside the databases, we can have multiple extensions and schemas. Inside each schema, we can have database objects such as tables, views, sequences, procedures, and functions. We are first going to create a schema named record and then we will create some tables in this schema. To create a schema named record in the warehouse_db database, use the following statement: warehouse_db=# CREATE SCHEMA record; Creating, altering, and truncating a table In this section, we will learn about creating a table, altering the table definition, and truncating the table. Creating tables Now, let's perform some DDL operations starting with creating tables. To create a table named warehouse_tbl, execute the following statements: warehouse_db=# CREATE TABLE warehouse_tbl ( warehouse_id INTEGER NOT NULL, warehouse_name TEXT NOT NULL, year_created INTEGER, street_address TEXT, city CHARACTER VARYING(100), state CHARACTER VARYING(2), zip CHARACTER VARYING(10), CONSTRAINT "PRIM_KEY" PRIMARY KEY (warehouse_id) ); The preceding statements created the table warehouse_tbl that has the primary key warehouse_id. Now, as you are familiar with the table creation syntax, let's create a sequence and use that in a table. You can create the hist_id_seq sequence using the following statement: warehouse_db=# CREATE SEQUENCE hist_id_seq; The preceding CREATE SEQUENCE command creates a new sequence number generator. This involves creating and initializing a new special single-row table with the name hist_id_seq. The user issuing the command will own the generator. You can now create the table that implements the hist_id_seq sequence using the following statement: warehouse_db=# CREATE TABLE history ( history_id INTEGER NOT NULL DEFAULT nextval('hist_id_seq'), date TIMESTAMP WITHOUT TIME ZONE, amount INTEGER, data TEXT, customer_id INTEGER, warehouse_id INTEGER, CONSTRAINT "PRM_KEY" PRIMARY KEY (history_id), CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) ); The preceding query will create a history table in the warehouse_db database, and the history_id column uses the sequence as the default input value. In this section, we successfully learned how to create a table and also learned how to use a sequence inside the table creation syntax. Altering tables Now that we have learned how to create multiple tables, we can practice some ALTER TABLE commands by following this section. With the ALTER TABLE command, we can add, remove, or rename table columns. Firstly, with the help of the following example, we will be able to add the phone_no column in the previously created table warehouse_tbl: warehouse_db=# ALTER TABLE warehouse_tbl ADD COLUMN phone_no INTEGER; We can then verify that a column is added in the table by describing the table as follows: warehouse_db=# d warehouse_tbl            Table "public.warehouse_tbl"                  Column     |         Type         | Modifiers ----------------+------------------------+----------- warehouse_id  | integer               | not null warehouse_name | text                   | not null year_created   | integer               | street_address | text                   | city           | character varying(100) | state           | character varying(2)   | zip             | character varying(10) | phone_no       | integer               | Indexes: "PRIM_KEY" PRIMARY KEY, btree (warehouse_id) Referenced by: TABLE "history" CONSTRAINT "FORN_KEY"FOREIGN KEY  (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) TABLE  "history" CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id)  REFERENCES warehouse_tbl(warehouse_id) To drop a column from a table, we can use the following statement: warehouse_db=# ALTER TABLE warehouse_tbl DROP COLUMN phone_no; We can then finally verify that the column has been removed from the table by describing the table again as follows: warehouse_db=# d warehouse_tbl            Table "public.warehouse_tbl"                  Column     |         Type         | Modifiers ----------------+------------------------+----------- warehouse_id   | integer               | not null warehouse_name | text                   | not null year_created   | integer               | street_address | text                   | city           | character varying(100) | state           | character varying(2)   | zip             | character varying(10) | Indexes: "PRIM_KEY" PRIMARY KEY, btree (warehouse_id) Referenced by: TABLE "history" CONSTRAINT "FORN_KEY" FOREIGN KEY  (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) TABLE  "history" CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id)  REFERENCES warehouse_tbl(warehouse_id) Truncating tables The TRUNCATE command is used to remove all rows from a table without providing any criteria. In the case of the DELETE command, the user has to provide the delete criteria using the WHERE clause. To truncate data from the table, we can use the following statement: warehouse_db=# TRUNCATE TABLE warehouse_tbl; We can then verify that the warehouse_tbl table has been truncated by performing a SELECT COUNT(*) query on it using the following statement: warehouse_db=# SELECT COUNT(*) FROM warehouse_tbl; count -------      0 (1 row) Inserting, updating, and deleting data from tables In this section, we will play around with data and learn how to insert, update, and delete data from a table. Inserting data So far, we have learned how to create and alter a table. Now it's time to play around with some data. Let's start by inserting records in the warehouse_tbl table using the following command snippet: warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 1, 'Mark Corp', 2009, '207-F Main Service Road East', 'New London', 'CT', 4321 ); We can then verify that the record has been inserted by performing a SELECT query on the warehouse_tbl table as follows: warehouse_db=# SELECT warehouse_id, warehouse_name, street_address               FROM warehouse_tbl; warehouse_id | warehouse_name |       street_address         ---------------+----------------+------------------------------- >             1 | Mark Corp     | 207-F Main Service Road East (1 row) Updating data Once we have inserted data in our table, we should know how to update it. This can be done using the following statement: warehouse_db=# UPDATE warehouse_tbl SET year_created=2010 WHERE year_created=2009; To verify that a record is updated, let's perform a SELECT query on the warehouse_tbl table as follows: warehouse_db=# SELECT warehouse_id, year_created FROM               warehouse_tbl; warehouse_id | year_created --------------+--------------            1 |         2010 (1 row) Deleting data To delete data from a table, we can use the DELETE command. Let's add a few records to the table and then later on delete data on the basis of certain conditions: warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 2, 'Bill & Co', 2014, 'Lilly Road', 'New London', 'CT', 4321 ); warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 3, 'West point', 2013, 'Down Town', 'New London', 'CT', 4321 ); We can then delete data from the warehouse.tbl table, where warehouse_name is Bill & Co, by executing the following statement: warehouse_db=# DELETE FROM warehouse_tbl WHERE warehouse_name='Bill & Co'; To verify that a record has been deleted, we will execute the following SELECT query: warehouse_db=# SELECT warehouse_id, warehouse_name FROM warehouse_tbl WHERE warehouse_name='Bill & Co'; warehouse_id | warehouse_name --------------+---------------- (0 rows) The DELETE command is used to drop a row from a table, whereas the DROP command is used to drop a complete table. The TRUNCATE command is used to empty the whole table. Summary In this article, we learned how to utilize the SQL language for a collection of everyday DBMS exercises in an easy-to-use practical way. We also figured out how to make a complete database that incorporates DDL (create, alter, and truncate) and DML (insert, update, and delete) operators. Resources for Article: Further resources on this subject: Indexes [Article] Improving proximity filtering with KNN [Article] Using Unrestricted Languages [Article]
Read more
  • 0
  • 0
  • 2587

article-image-performance-considerations
Packt
03 Mar 2015
13 min read
Save for later

Performance Considerations

Packt
03 Mar 2015
13 min read
In this article by Dayong Du, the author of Apache Hive Essentials, we will look at the different performance considerations when using Hive. Although Hive is built to deal with big data, we still cannot ignore the importance of performance. Most of the time, a better Hive query can rely on the smart query optimizer to find the best execution strategy as well as the default setting best practice from vendor packages. However, as experienced users, we should learn more about the theory and practice of performance tuning in Hive, especially when working in a performance-based project or environment. We will start from utilities available in Hive to find potential issues causing poor performance. Then, we introduce the best practices of performance considerations in the areas of queries and job. (For more resources related to this topic, see here.) Performance utilities Hive provides the EXPLAIN and ANALYZE statements that can be used as utilities to check and identify the performance of queries. The EXPLAIN statement Hive provides an EXPLAIN command to return a query execution plan without running the query. We can use an EXPLAIN command for queries if we have a doubt or a concern about performance. The EXPLAIN command will help to see the difference between two or more queries for the same purpose. The syntax for EXPLAIN is as follows: EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] hive_query The following keywords can be used: EXTENDED: This provides additional information for the operators in the plan, such as file pathname and abstract syntax tree. DEPENDENCY: This provides a JSON format output that contains a list of tables and partitions that the query depends on. It is available since HIVE 0.10.0. AUTHORIZATION: This lists all entities needed to be authorized including input and output to run the Hive query and authorization failures, if any. It is available since HIVE 0.14.0. A typical query plan contains the following three sections. We will also have a look at an example later: Abstract syntax tree (AST): Hive uses a pacer generator called ANTLR (see http://www.antlr.org/) to automatically generate a tree of syntax for HQL. We can usually ignore this most of the time. Stage dependencies: This lists all dependencies and number of stages used to run the query. Stage plans: It contains important information, such as operators and sort orders, for running the job. The following is what a typical query plan looks like. From the following example, we can see that the AST section is not shown since the EXTENDED keyword is not used with EXPLAIN. In the STAGE DEPENDENCIES section, both Stage-0 and Stage-1 are independent root stages. In the STAGE PLANS section, Stage-1 has one map and reduce referred to by Map Operator Tree and Reduce Operator Tree. Inside each Map/Reduce Operator Tree section, all operators corresponding to Hive query keywords as well as expressions and aggregations are listed. The Stage-0 stage does not have map and reduce. It is just a Fetch operation. jdbc:hive2://> EXPLAIN SELECT sex_age.sex, count(*). . . . . . .> FROM employee_partitioned. . . . . . .> WHERE year=2014 GROUP BY sex_age.sex LIMIT 2;+-----------------------------------------------------------------------------+| Explain |+-----------------------------------------------------------------------------+| STAGE DEPENDENCIES: || Stage-1 is a root stage || Stage-0 is a root stage || || STAGE PLANS: || Stage: Stage-1 || Map Reduce || Map Operator Tree: || TableScan || alias: employee_partitioned || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Select Operator || expressions: sex_age (type: struct<sex:string,age:int>) || outputColumnNames: sex_age || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Group By Operator || aggregations: count() || keys: sex_age.sex (type: string) || mode: hash || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Reduce Output Operator || key expressions: _col0 (type: string) || sort order: + || Map-reduce partition columns: _col0 (type: string) || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL|| Column stats: NONE || value expressions: _col1 (type: bigint) || Reduce Operator Tree: || Group By Operator || aggregations: count(VALUE._col0) || keys: KEY._col0 (type: string) || mode: mergepartial || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || Select Operator || expressions: _col0 (type: string), _col1 (type: bigint) || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || Limit || Number of rows: 2 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || File Output Operator || compressed: false || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || table: || input format: org.apache.hadoop.mapred.TextInputFormat || output format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|| serde:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe|| || Stage: Stage-0 || Fetch Operator || limit: 2 |+-----------------------------------------------------------------------------+53 rows selected (0.26 seconds) The ANALYZE statement Hive statistics are a collection of data that describe more details, such as the number of rows, number of files, and raw data size, on the objects in the Hive database. Statistics is a metadata of Hive data. Hive supports statistics at the table, partition, and column level. These statistics serve as an input to the Hive Cost-Based Optimizer (CBO), which is an optimizer to pick the query plan with the lowest cost in terms of system resources required to complete the query. The statistics are gathered through the ANALYZE statement since Hive 0.10.0 on tables, partitions, and columns as given in the following examples: jdbc:hive2://> ANALYZE TABLE employee COMPUTE STATISTICS;No rows affected (27.979 seconds)jdbc:hive2://> ANALYZE TABLE employee_partitioned. . . . . . .> PARTITION(year=2014, month=12) COMPUTE STATISTICS;No rows affected (45.054 seconds)jdbc:hive2://> ANALYZE TABLE employee_id COMPUTE STATISTICS. . . . . . .> FOR COLUMNS employee_id;No rows affected (41.074 seconds) Once the statistics are built, we can check the statistics by the DESCRIBE EXTENDED/FORMATTED statement. From the table/partition output, we can find the statistics information inside the parameters, such as parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}). The following is an example: jdbc:hive2://> DESCRIBE EXTENDED employee_partitioned. . . . . . .> PARTITION(year=2014, month=12);jdbc:hive2://> DESCRIBE EXTENDED employee;…parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}).jdbc:hive2://> DESCRIBE FORMATTED employee.name;+--------+---------+---+---+---------+--------------+-----------+-----------+|col_name|data_type|min|max|num_nulls|distinct_count|avg_col_len|max_col_len|+--------+---------+---+---+---------+--------------+-----------+-----------+| name | string | | | 0 | 5 | 5.6 | 7 |+--------+---------+---+---+---------+--------------+-----------+-----------++---------+----------+-----------------+|num_trues|num_falses| comment |+---------+----------+-----------------+| | |from deserializer|+---------+----------+-----------------+3 rows selected (0.116 seconds) Hive statistics are persisted in the metastore to avoid computing them every time. For newly created tables and/or partitions, statistics are automatically computed by default if we enable the following setting: jdbc:hive2://> SET hive.stats.autogather=ture; Hive logs Logs provide useful information to find out how a Hive query/job runs. By checking the Hive logs, we can identify runtime problems and issues that may cause bad performance. There are two types of logs available in Hive: system log and job log. The system log contains the Hive running status and issues. It is configured in {HIVE_HOME}/conf/hive-log4j.properties. The following three lines for Hive log can be found: hive.root.logger=WARN,DRFAhive.log.dir=/tmp/${user.name}hive.log.file=hive.log To modify the status, we can either modify the preceding lines in hive-log4j.properties (applies to all users) or set from the Hive CLI (only applies to the current user and current session) as follows: hive --hiveconf hive.root.logger=DEBUG,console The job log contains Hive query information and is saved at the same place, /tmp/${user.name}, by default as one file for each Hive user session. We can override it in hive-site.xml with the hive.querylog.location property. If a Hive query generates MapReduce jobs, those logs can also be viewed through the Hadoop JobTracker Web UI. Job and query optimization Job and query optimization covers experience and skills to improve performance in the area of job-running mode, JVM reuse, job parallel running, and query optimizations in JOIN. Local mode Hadoop can run in standalone, pseudo-distributed, and fully distributed mode. Most of the time, we need to configure Hadoop to run in fully distributed mode. When the data to process is small, it is an overhead to start distributed data processing since the launching time of the fully distributed mode takes more time than the job processing time. Since Hive 0.7.0, Hive supports automatic conversion of a job to run in local mode with the following settings: jdbc:hive2://> SET hive.exec.mode.local.auto=true; --default falsejdbc:hive2://> SET hive.exec.mode.local.auto.inputbytes.max=50000000;jdbc:hive2://> SET hive.exec.mode.local.auto.input.files.max=5;--default 4 A job must satisfy the following conditions to run in the local mode: The total input size of the job is lower than hive.exec.mode.local.auto.inputbytes.max The total number of map tasks is less than hive.exec.mode.local.auto.input.files.max The total number of reduce tasks required is 1 or 0 JVM reuse By default, Hadoop launches a new JVM for each map or reduce job and runs the map or reduce task in parallel. When the map or reduce job is a lightweight job running only for a few seconds, the JVM startup process could be a significant overhead. The MapReduce framework (version 1 only, not Yarn) has an option to reuse JVM by sharing the JVM to run mapper/reducer serially instead of parallel. JVM reuse applies to map or reduce tasks in the same job. Tasks from different jobs will always run in a separate JVM. To enable the reuse, we can set the maximum number of tasks for a single job for JVM reuse using the mapred.job.reuse.jvm.num.tasks property. Its default value is 1: jdbc:hive2://> SET mapred.job.reuse.jvm.num.tasks=5; We can also set the value to –1 to indicate that all the tasks for a job will run in the same JVM. Parallel execution Hive queries commonly are translated into a number of stages that are executed by the default sequence. These stages are not always dependent on each other. Instead, they can run in parallel to save the overall job running time. We can enable this feature with the following settings: jdbc:hive2://> SET hive.exec.parallel=true; -- default falsejdbc:hive2://> SET hive.exec.parallel.thread.number=16;-- default 8, it defines the max number for running in parallel Parallel execution will increase the cluster utilization. If the utilization of a cluster is already very high, parallel execution will not help much in terms of overall performance. Join optimization Here, we'll briefly review the key settings for join improvement. Common join The common join is also called reduce side join. It is a basic join in Hive and works for most of the time. For common joins, we need to make sure the big table is on the right-most side or specified by hit, as follows: /*+ STREAMTABLE(stream_table_name) */. Map join Map join is used when one of the join tables is small enough to fit in the memory, so it is very fast but limited. Since Hive 0.7.0, Hive can convert map join automatically with the following settings: jdbc:hive2://> SET hive.auto.convert.join=true; --default falsejdbc:hive2://> SET hive.mapjoin.smalltable.filesize=600000000;--default 25Mjdbc:hive2://> SET hive.auto.convert.join.noconditionaltask=true;--default false. Set to true so that map join hint is not needed jdbc:hive2://> SET hive.auto.convert.join.noconditionaltask.size=10000000;--The default value controls the size of table to fit in memory Once autoconvert is enabled, Hive will automatically check if the smaller table file size is bigger than the value specified by hive.mapjoin.smalltable.filesize, and then Hive will convert the join to a common join. If the file size is smaller than this threshold, it will try to convert the common join into a map join. Once autoconvert join is enabled, there is no need to provide the map join hints in the query. Bucket map join Bucket map join is a special type of map join applied on the bucket tables. To enable bucket map join, we need to enable the following settings: jdbc:hive2://> SET hive.auto.convert.join=true; --default falsejdbc:hive2://> SET hive.optimize.bucketmapjoin=true; --default false In bucket map join, all the join tables must be bucket tables and join on buckets columns. In addition, the buckets number in bigger tables must be a multiple of the bucket number in the small tables. Sort merge bucket (SMB) join SMB is the join performed on the bucket tables that have the same sorted, bucket, and join condition columns. It reads data from both bucket tables and performs common joins (map and reduce triggered) on the bucket tables. We need to enable the following properties to use SMB: jdbc:hive2://> SET hive.input.format=. . . . . . .> org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;jdbc:hive2://> SET hive.auto.convert.sortmerge.join=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin.sortedmerge=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.noconditionaltask=true; Sort merge bucket map (SMBM) join SMBM join is a special bucket join but triggers map-side join only. It can avoid caching all rows in the memory like map join does. To perform SMBM joins, the join tables must have the same bucket, sort, and join condition columns. To enable such joins, we need to enable the following settings: jdbc:hive2://> SET hive.auto.convert.join=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join=truejdbc:hive2://> SET hive.optimize.bucketmapjoin=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin.sortedmerge=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.noconditionaltask=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.bigtable.selection.policy=org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ; Skew join When working with data that has a highly uneven distribution, the data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation. The following setting informs Hive to optimize properly if data skew happens: jdbc:hive2://> SET hive.optimize.skewjoin=true;--If there is data skew in join, set it to true. Default is false. jdbc:hive2://> SET hive.skewjoin.key=100000;--This is the default value. If the number of key is bigger than--this, the new keys will send to the other unused reducers. Skew data could happen on the GROUP BY data too. To optimize it, we need to do the following settings to enable skew data optimization in the GROUP BY result: SET hive.groupby.skewindata=true; Once configured, Hive will first trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew. For more information about Hive join optimization, please refer to the Apache Hive wiki available at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization and https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization. Summary In this article, we first covered how to identify performance bottlenecks using the EXPLAIN and ANALYZE statements. Then, we discussed job and query optimization in Hive. Resources for Article: Further resources on this subject: Apache Maven and m2eclipse [Article] Apache Karaf – Provisioning and Clusters [Article] Introduction to Apache ZooKeeper [Article]
Read more
  • 0
  • 0
  • 2339

article-image-elasticsearch-administration
Packt
03 Mar 2015
28 min read
Save for later

Elasticsearch Administration

Packt
03 Mar 2015
28 min read
In this article by Rafał Kuć and Marek Rogoziński, author of the book Mastering Elasticsearch, Second Edition we will talk more about the Elasticsearch configuration and new features introduced in Elasticsearch 1.0 and higher. By the end of this article, you will have learned: (For more resources related to this topic, see here.) Configuring the discovery and recovery modules Using the Cat API that allows a human-readable insight into the cluster status The backup and restore functionality Federated search Discovery and recovery modules When starting your Elasticsearch node, one of the first things that Elasticsearch does is look for a master node that has the same cluster name and is visible in the network. If a master node is found, the starting node gets joined into an already formed cluster. If no master is found, then the node itself is selected as a master (of course, if the configuration allows such behavior). The process of forming a cluster and finding nodes is called discovery. The module responsible for discovery has two main purposes—electing a master and discovering new nodes within a cluster. After the cluster is formed, a process called recovery is started. During the recovery process, Elasticsearch reads the metadata and the indices from the gateway, and prepares the shards that are stored there to be used. After the recovery of the primary shards is done, Elasticsearch should be ready for work and should continue with the recovery of all the replicas (if they are present). In this section, we will take a deeper look at these two modules and discuss the possibilities of configuration Elasticsearch gives us and what the consequences of changing them are. Note that the information provided in the Discovery and recovery modules section is an extension of what we already wrote in Elasticsearch Server Second Edition, published by Packt Publishing. Discovery configuration As we have already mentioned multiple times, Elasticsearch was designed to work in a distributed environment. This is the main difference when comparing Elasticsearch to other open source search and analytics solutions available. With such assumptions, Elasticsearch is very easy to set up in a distributed environment, and we are not forced to set up additional software to make it work like this. By default, Elasticsearch assumes that the cluster is automatically formed by the nodes that declare the same cluster.name setting and can communicate with each other using multicast requests. This allows us to have several independent clusters in the same network. There are a few implementations of the discovery module that we can use, so let's see what the options are. Zen discovery Zen discovery is the default mechanism that's responsible for discovery in Elasticsearch and is available by default. The default Zen discovery configuration uses multicast to find other nodes. This is a very convenient solution: just start a new Elasticsearch node and everything works—this node will be joined to the cluster if it has the same cluster name and is visible by other nodes in that cluster. This discovery method is perfectly suited for development time, because you don't need to care about the configuration; however, it is not advised that you use it in production environments. Relying only on the cluster name is handy but can also lead to potential problems and mistakes, such as the accidental joining of nodes. Sometimes, multicast is not available for various reasons or you don't want to use it for these mentioned reasons. In the case of bigger clusters, the multicast discovery may generate too much unnecessary traffic, and this is another valid reason why it shouldn't be used for production. For these cases, Zen discovery allows us to use the unicast mode. When using the unicast Zen discovery, a node that is not a part of the cluster will send a ping request to all the addresses specified in the configuration. By doing this, it informs all the specified nodes that it is ready to be a part of the cluster and can be either joined to an existing cluster or can form a new one. Of course, after the node joins the cluster, it gets the cluster topology information, but the initial connection is only done to the specified list of hosts. Remember that even when using unicast Zen discovery, the Elasticsearch node still needs to have the same cluster name as the other nodes. If you want to know more about the differences between multicast and unicast ping methods, refer to these URLs: http://en.wikipedia.org/wiki/Multicast and http://en.wikipedia.org/wiki/Unicast. If you still want to learn about the configuration properties of multicast Zen discovery, let's look at them. Multicast Zen discovery configuration The multicast part of the Zen discovery module exposes the following settings: discovery.zen.ping.multicast.address (the default: all available interfaces): This is the interface used for the communication given as the address or interface name. discovery.zen.ping.multicast.port (the default: 54328): This port is used for communication. discovery.zen.ping.multicast.group (the default: 224.2.2.4): This is the multicast address to send messages to. discovery.zen.ping.multicast.buffer_size (the default: 2048): This is the size of the buffer used for multicast messages. discovery.zen.ping.multicast.ttl (the default: 3): This is the time for which a multicast message lives. Every time a packet crosses the router, the TTL is decreased. This allows for the limiting area where the transmission can be received. Note that routers can have the threshold values assigned compared to TTL, which causes that TTL value to not match exactly the number of routers that a packet can jump over. discovery.zen.ping.multicast.enabled (the default: true): Setting this property to false turns off the multicast. You should disable multicast if you are planning to use the unicast discovery method. The unicast Zen discovery configuration The unicast part of Zen discovery provides the following configuration options: discovery.zen.ping.unicats.hosts: This is the initial list of nodes in the cluster. The list can be defined as a list or as an array of hosts. Every host can be given a name (or an IP address) or have a port or port range added. For example, the value of this property can look like this: ["master1", "master2:8181", "master3[80000-81000]"]. So, basically, the hosts' list for the unicast discovery doesn't need to be a complete list of Elasticsearch nodes in your cluster, because once the node is connected to one of the mentioned nodes, it will be informed about all the others that form the cluster. discovery.zen.ping.unicats.concurrent_connects (the default: 10): This is the maximum number of concurrent connections unicast discoveries will use. If you have a lot of nodes that the initial connection should be made to, it is advised that you increase the default value. Master node One of the main purposes of discovery apart from connecting to other nodes is to choose a master node—a node that will take care of and manage all the other nodes. This process is called master election and is a part of the discovery module. No matter how many master eligible nodes there are, each cluster will only have a single master node active at a given time. If there is more than one master eligible node present in the cluster, they can be elected as the master when the original master fails and is removed from the cluster. Configuring master and data nodes By default, Elasticsearch allows every node to be a master node and a data node. However, in certain situations, you may want to have worker nodes, which will only hold the data or process the queries and the master nodes that will only be used as cluster-managed nodes. One of these situations is to handle a massive amount of data, where data nodes should be as performant as possible, and there shouldn't be any delay in master nodes' responses. Configuring data-only nodes To set the node to only hold data, we need to instruct Elasticsearch that we don't want such a node to be a master node. In order to do this, we add the following properties to the elasticsearch.yml configuration file: node.master: falsenode.data: true Configuring master-only nodes To set the node not to hold data and only to be a master node, we need to instruct Elasticsearch that we don't want such a node to hold data. In order to do that, we add the following properties to the elasticsearch.yml configuration file: node.master: truenode.data: false Configuring the query processing-only nodes For large enough deployments, it is also wise to have nodes that are only responsible for aggregating query results from other nodes. Such nodes should be configured as nonmaster and nondata, so they should have the following properties in the elasticsearch.yml configuration file: node.master: falsenode.data: false Please note that the node.master and the node.data properties are set to true by default, but we tend to include them for configuration clarity. The master election configuration We already wrote about the master election configuration in Elasticsearch Server Second Edition, but this topic is very important, so we decided to refresh our knowledge about it. Imagine that you have a cluster that is built of 10 nodes. Everything is working fine until, one day, your network fails and three of your nodes are disconnected from the cluster, but they still see each other. Because of the Zen discovery and the master election process, the nodes that got disconnected elect a new master and you end up with two clusters with the same name with two master nodes. Such a situation is called a split-brain and you must avoid it as much as possible. When a split-brain happens, you end up with two (or more) clusters that won't join each other until the network (or any other) problems are fixed. If you index your data during this time, you may end up with data loss and unrecoverable situations when the nodes get joined together after the network split. In order to prevent split-brain situations or at least minimize the possibility of their occurrences, Elasticsearch provides a discovery.zen.minimum_master_nodes property. This property defines a minimum amount of master eligible nodes that should be connected to each other in order to form a cluster. So now, let's get back to our cluster; if we set the discovery.zen.minimum_master_nodes property to 50 percent of the total nodes available plus one (which is six, in our case), we would end up with a single cluster. Why is that? Before the network failure, we would have 10 nodes, which is more than six nodes, and these nodes would form a cluster. After the disconnections of the three nodes, we would still have the first cluster up and running. However, because only three nodes disconnected and three is less than six, these three nodes wouldn't be allowed to elect a new master and they would wait for reconnection with the original cluster. Zen discovery fault detection and configuration Elasticsearch runs two detection processes while it is working. The first process is to send ping requests from the current master node to all the other nodes in the cluster to check whether they are operational. The second process is a reverse of that—each of the nodes sends ping requests to the master in order to verify that it is still up and running and performing its duties. However, if we have a slow network or our nodes are in different hosting locations, the default configuration may not be sufficient. Because of this, the Elasticsearch discovery module exposes three properties that we can change: discovery.zen.fd.ping_interval: This defaults to 1s and specifies the interval of how often the node will send ping requests to the target node. discovery.zen.fd.ping_timeout: This defaults to 30s and specifies how long the node will wait for the sent ping request to be responded to. If your nodes are 100 percent utilized or your network is slow, you may consider increasing that property value. discovery.zen.fd.ping_retries: This defaults to 3 and specifies the number of ping request retries before the target node will be considered not operational. You can increase this value if your network has a high number of lost packets (or you can fix your network). There is one more thing that we would like to mention. The master node is the only node that can change the state of the cluster. To achieve a proper cluster state updates sequence, Elasticsearch master nodes process single cluster state update requests one at a time, make the changes locally, and send the request to all the other nodes so that they can synchronize their state. The master nodes wait for the given time for the nodes to respond, and if the time passes or all the nodes are returned, with the current acknowledgment information, it proceeds with the next cluster state update request processing. To change the time, the master node waits for all the other nodes to respond, and you should modify the default 30 seconds time by setting the discovery.zen.publish_timeout property. Increasing the value may be needed for huge clusters working in an overloaded network. The Amazon EC2 discovery Amazon, in addition to selling goods, has a few popular services such as selling storage or computing power in a pay-as-you-go model. So-called Amazon Elastic Compute Cloud (EC2) provides server instances and, of course, they can be used to install and run Elasticsearch clusters (among many other things, as these are normal Linux machines). This is convenient—you pay for instances that are needed in order to handle the current traffic or to speed up calculations, and you shut down unnecessary instances when the traffic is lower. Elasticsearch works well on EC2, but due to the nature of the environment, some features may work slightly differently. One of these features that works differently is discovery, because Amazon EC2 doesn't support multicast discovery. Of course, we can switch to unicast discovery, but sometimes, we want to be able to automatically discover nodes and, with unicast, we need to at least provide the initial list of hosts. However, there is an alternative—we can use the Amazon EC2 plugin, a plugin that combines the multicast and unicast discovery methods using the Amazon EC2 API. Make sure that during the set up of EC2 instances, you set up communication between them (on port 9200 and 9300 by default). This is crucial in order to have Elasticsearch nodes communicate with each other and, thus, cluster functioning is required. Of course, this communication depends on network.bind_host and network.publish_host (or network.host) settings. The EC2 plugin installation The installation of a plugin is as simple as with most of the plugins. In order to install it, we should run the following command: bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.0 The EC2 plugin's generic configuration This plugin provides several configuration settings that we need to provide in order for the EC2 discovery to work: cluster.aws.access_key: Amazon access key—one of the credential values you can find in the Amazon configuration panel cluster.aws.secret_key: Amazon secret key—similar to the previously mentioned access_key setting, it can be found in the EC2 configuration panel The last thing is to inform Elasticsearch that we want to use a new discovery type by setting the discovery.type property to ec2 value and turn off multicast. Optional EC2 discovery configuration options The previously mentioned settings are sufficient to run the EC2 discovery, but in order to control the EC2 discovery plugin behavior, Elasticsearch exposes additional settings: cloud.aws.region: This region will be used to connect with Amazon EC2 web services. You can choose a region that's adequate for the region where your instance resides, for example, eu-west-1 for Ireland. The possible values can be eu-west, sa-east, us-east, us-west-1, us-west-2, ap-southeast-1, and ap-southeast-1. cloud.aws.ec2.endpoint: If you are using EC2 API services, instead of defining a region, you can provide an address of the AWS endpoint, for example, ec2.eu-west-1.amazonaws.com. cloud.aws.protocol: This is the protocol that should be used by the plugin to connect to Amazon Web Services endpoints. By default, Elasticsearch will use the HTTPS protocol (which means setting the value of the property to https). We can also change this behavior and set the property to http for the plugin to use HTTP without encryption. We are also allowed to overwrite the cloud.aws.protocol settings for each service by using the cloud.aws.ec2.protocol and cloud.aws.s3.protocol properties (the possible values are the same—https and http). cloud.aws.proxy_host: Elasticsearch allows us to define a proxy that will be used to connect to AWS endpoints. The cloud.aws.proxy_host property should be set to the address to the proxy that should be used. cloud.aws.proxy_port: The second property related to the AWS endpoints proxy allows us to specify the port on which the proxy is listening. The cloud.aws.proxy_port property should be set to the port on which the proxy listens. discovery.ec2.ping_timeout (the default: 3s): This is the time to wait for the response for the ping message sent to the other node. After this time, the nonresponsive node will be considered dead and removed from the cluster. Increasing this value makes sense when dealing with network issues or we have a lot of EC2 nodes. The EC2 nodes scanning configuration The last group of settings we want to mention allows us to configure a very important thing when building cluster working inside the EC2 environment—the ability to filter available Elasticsearch nodes in our Amazon Elastic Cloud Computing network. The Elasticsearch EC2 plugin exposes the following properties that can help us configure its behavior: discovery.ec2.host_type: This allows us to choose the host type that will be used to communicate with other nodes in the cluster. The values we can use are private_ip (the default one; the private IP address will be used for communication), public_ip (the public IP address will be used for communication), private_dns (the private hostname will be used for communication), and public_dns (the public hostname will be used for communication). discovery.ec2.groups: This is a comma-separated list of security groups. Only nodes that fall within these groups can be discovered and included in the cluster. discovery.ec2.availability_zones: This is array or command-separated list of availability zones. Only nodes with the specified availability zones will be discovered and included in the cluster. discovery.ec2.any_group (this defaults to true): Setting this property to false will force the EC2 discovery plugin to discover only those nodes that reside in an Amazon instance that falls into all of the defined security groups. The default value requires only a single group to be matched. discovery.ec2.tag: This is a prefix for a group of EC2-related settings. When you launch your Amazon EC2 instances, you can define tags, which can describe the purpose of the instance, such as the customer name or environment type. Then, you use these defined settings to limit discovery nodes. Let's say you define a tag named environment with a value of qa. In the configuration, you can now specify the following: discovery.ec2.tag.environment: qa and only nodes running on instances with this tag will be considered for discovery. cloud.node.auto_attributes: When this is set to true, Elasticsearch will add EC2-related node attributes (such as the availability zone or group) to the node properties and will allow us to use them, adjusting the Elasticsearch shard allocation and configuring the shard placement. Other discovery implementations The Zen discovery and EC2 discovery are not the only discovery types that are available. There are two more discovery types that are developed and maintained by the Elasticsearch team, and these are: Azure discovery: https://github.com/elasticsearch/elasticsearch-cloud-azure Google Compute Engine discovery: https://github.com/elasticsearch/elasticsearch-cloud-gce In addition to these, there are a few discovery implementations provided by the community, such as the ZooKeeper discovery for older versions of Elasticsearch (https://github.com/sonian/elasticsearch-zookeeper). The gateway and recovery configuration The gateway module allows us to store all the data that is needed for Elasticsearch to work properly. This means that not only is the data in Apache Lucene indices stored, but also all the metadata (for example, index allocation settings), along with the mappings configuration for each index. Whenever the cluster state is changed, for example, when the allocation properties are changed, the cluster state will be persisted by using the gateway module. When the cluster is started up, its state will be loaded using the gateway module and applied. One should remember that when configuring different nodes and different gateway types, indices will use the gateway type configuration present on the given node. If an index state should not be stored using the gateway module, one should explicitly set the index gateway type to none. The gateway recovery process Let's say explicitly that the recovery process is used by Elasticsearch to load the data stored with the use of the gateway module in order for Elasticsearch to work. Whenever a full cluster restart occurs, the gateway process kicks in to load all the relevant information we've mentioned—the metadata, the mappings, and of course, all the indices. When the recovery process starts, the primary shards are initialized first, and then, depending on the replica state, they are initialized using the gateway data, or the data is copied from the primary shards if the replicas are out of sync. Elasticsearch allows us to configure when the cluster data should be recovered using the gateway module. We can tell Elasticsearch to wait for a certain number of master eligible or data nodes to be present in the cluster before starting the recovery process. However, one should remember that when the cluster is not recovered, all the operations performed on it will not be allowed. This is done in order to avoid modification conflicts. Configuration properties Before we continue with the configuration, we would like to say one more thing. As you know, Elasticsearch nodes can play different roles—they can have a role of data nodes—the ones that hold data—they can have a master role, or they can be only used for request handing, which means not holding data and not being master eligible. Remembering all this, let's now look at the gateway configuration properties that we are allowed to modify: gateway.recover_after_nodes: This is an integer number that specifies how many nodes should be present in the cluster for the recovery to happen. For example, when set to 5, at least 5 nodes (doesn't matter whether they are data or master eligible nodes) must be present for the recovery process to start. gateway.recover_after_data_nodes: This is an integer number that allows us to set how many data nodes should be present in the cluster for the recovery process to start. gateway.recover_after_master_nodes: This is another gateway configuration option that allows us to set how many master eligible nodes should be present in the cluster for the recovery to start. gateway.recover_after_time: This allows us to set how much time to wait before the recovery process starts after the conditions defined by the preceding properties are met. If we set this property to 5m, we tell Elasticsearch to start the recovery process 5 minutes after all the defined conditions are met. The default value for this property is 5m, starting from Elasticsearch 1.3.0. Let's imagine that we have six nodes in our cluster, out of which four are data eligible. We also have an index that is built of three shards, which are spread across the cluster. The last two nodes are master eligible and they don't hold the data. What we would like to configure is the recovery process to be delayed for 3 minutes after the four data nodes are present. Our gateway configuration could look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3m Expectations on nodes In addition to the already mentioned properties, we can also specify properties that will force the recovery process of Elasticsearch. These properties are: gateway.expected_nodes: This is the number of nodes expected to be present in the cluster for the recovery to start immediately. If you don't need the recovery to be delayed, it is advised that you set this property to the number of nodes (or at least most of them) with which the cluster will be formed from, because that will guarantee that the latest cluster state will be recovered. gateway.expected_data_nodes: This is the number of expected data eligible nodes to be present in the cluster for the recovery process to start immediately. gateway.expected_master_nodes: This is the number of expected master eligible nodes to be present in the cluster for the recovery process to start immediately. Now, let's get back to our previous example. We know that when all six nodes are connected and are in the cluster, we want the recovery to start. So, in addition to the preceeding configuration, we would add the following property: gateway.expected_nodes: 6 So the whole configuration would look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3mgateway.expected_nodes: 6 The preceding configuration says that the recovery process will be delayed for 3 minutes once four data nodes join the cluster and will begin immediately after six nodes are in the cluster (doesn't matter whether they are data nodes or master eligible nodes). The local gateway With the release of Elasticsearch 0.20 (and some of the releases from 0.19 versions), all the gateway types, apart from the default local gateway type, were deprecated. It is advised that you do not use them, because they will be removed in future versions of Elasticsearch. This is still not the case, but if you want to avoid full data reindexation, you should only use the local gateway type, and this is why we won't discuss all the other types. The local gateway type uses a local storage available on a node to store the metadata, mappings, and indices. In order to use this gateway type and the local storage available on the node, there needs to be enough disk space to hold the data with no memory caching. The persistence to the local gateway is different from the other gateways that are currently present (but deprecated). The writes to this gateway are done in a synchronous manner in order to ensure that no data will be lost during the write process. In order to set the type of gateway that should be used, one should use the gateway.type property, which is set to local by default. There is one additional thing regarding the local gateway of Elasticsearch that we didn't talk about—dangling indices. When a node joins a cluster, all the shards and indices that are present on the node, but are not present in the cluster, will be included in the cluster state. Such indices are called dangling indices, and we are allowed to choose how Elasticsearch should treat them. Elasticsearch exposes the gateway.local.auto_import_dangling property, which can take the value of yes (the default value that results in importing all dangling indices into the cluster), close (results in importing the dangling indices into the cluster state but keeps them closed by default), and no (results in removing the dangling indices). When setting the gateway.local.auto_import_dangling property to no, we can also set the gateway.local.dangling_timeout property (defaults to 2h) to specify how long Elasticsearch will wait while deleting the dangling indices. The dangling indices feature can be nice when we restart old Elasticsearch nodes, and we don't want old indices to be included in the cluster. Low-level recovery configuration We discussed that we can use the gateway to configure the behavior of the Elasticsearch recovery process, but in addition to that, Elasticsearch allows us to configure the recovery process itself. However, we decided that it would be good to mention the properties we can use in the section dedicated to gateway and recovery. Cluster- level recovery configuration The recovery configuration is specified mostly on the cluster level and allows us to set general rules for the recovery module to work with. These settings are: indices.recovery.concurrent_streams: This defaults to 3 and specifies the number of concurrent streams that are allowed to be opened in order to recover a shard from its source. The higher the value of this property, the more pressure will be put on the networking layer; however, the recovery may be faster, depending on your network usage and throughput. indices.recovery.max_bytes_per_sec: By default, this is set to 20MB and specifies the maximum number of data that can be transferred during shard recovery per second. In order to disable data transfer limiting, one should set this property to 0. Similar to the number of concurrent streams, this property allows us to control the network usage of the recovery process. Setting this property to higher values may result in higher network utilization and a faster recovery process. indices.recovery.compress: This is set to true by default and allows us to define whether ElasticSearch should compress the data that is transferred during the recovery process. Setting this to false may lower the pressure on the CPU, but it will also result in more data being transferred over the network. indices.recovery.file_chunk_size: This is the chunk size used to copy the shard data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. indices.recovery.translog_ops: This defaults to 1000 and specifies how many transaction log lines should be transferred between shards in a single request during the recovery process. indices.recovery.translog_size: This is the chunk size used to copy the shard transaction log data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. In the versions prior to Elasticsearch 0.90.0, there was the indices.recovery.max_size_per_sec property that could be used, but it was deprecated, and it is suggested that you use the indices.recovery.max_bytes_per_sec property instead. However, if you are using an Elasticsearch version older than 0.90.0, it may be worth remembering this. All the previously mentioned settings can be updated using the Cluster Update API, or they can be set in the elasticsearch.yml file. Index-level recovery settings In addition to the values mentioned previously, there is a single property that can be set on a per-index basis. The property can be set both in the elasticsearch.yml file and using the indices Update Settings API, and it is called index.recovery.initial_shards. In general, Elasticsearch will only recover a particular shard when there is a quorum of shards present and if that quorum can be allocated. A quorum is 50 percent of the shards for the given index plus one. By using the index.recovery.initial_shards property, we can change what Elasticsearch will take as a quorum. This property can be set to the one of the following values: quorum: 50 percent, plus one shard needs to be present and be allocable. This is the default value. quorum-1: 50 percent of the shards for a given index need to be present and be allocable. full: All of the shards for the given index need to be present and be allocable. full-1: 100 percent minus one shards for the given index need to be present and be allocable. integer value: Any integer such as 1, 2, or 5 specifies the number of shards that are needed to be present and that can be allocated. For example, setting this value to 2 will mean that at least two shards need to be present and Elasticsearch needs at least 2 shards to be allocable. It is good to know about this property, but in most cases, the default value will be sufficient for most deployments. Summary In this article, we focused more on the Elasticsearch configuration and new features that were introduced in Elasticsearch 1.0. We configured discovery and recovery, and we used the human-friendly Cat API. In addition to that, we used the backup and restore functionality, which allowed easy backup and recovery of our indices. Finally, we looked at what federated search is and how to search and index data to multiple clusters, while still using all the functionalities of Elasticsearch and being connected to a single node. If you want to dig deeper, buy the book Mastering Elasticsearch, Second Edition and read in a simple step-by-step fashion using Elasticsearch to enhance your knowlege further. Resources for Article: Further resources on this subject: Downloading and Setting Up ElasticSearch [Article] Indexing the Data [Article] Driving Visual Analyses with Automobile Data (Python) [Article]
Read more
  • 0
  • 0
  • 5417

article-image-mapreduce-functions
Packt
03 Mar 2015
11 min read
Save for later

MapReduce functions

Packt
03 Mar 2015
11 min read
 In this article, by John Zablocki, author of the book, Couchbase Essentials, you will be acquainted to MapReduce and how you'll use it to create secondary indexes for our documents. At its simplest, MapReduce is a programming pattern used to process large amounts of data that is typically distributed across several nodes in parallel. In the NoSQL world, MapReduce implementations may be found on many platforms from MongoDB to Hadoop, and of course, Couchbase. Even if you're new to the NoSQL landscape, it's quite possible that you've already worked with a form of MapReduce. The inspiration for MapReduce in distributed NoSQL systems was drawn from the functional programming concepts of map and reduce. While purely functional programming languages haven't quite reached mainstream status, languages such as Python, C#, and JavaScript all support map and reduce operations. (For more resources related to this topic, see here.) Map functions Consider the following Python snippet: numbers = [1, 2, 3, 4, 5] doubled = map(lambda n: n * 2, numbers) #doubled == [2, 4, 6, 8, 10] These two lines of code demonstrate a very simple use of a map() function. In the first line, the numbers variable is created as a list of integers. The second line applies a function to the list to create a new mapped list. In this case, the map() function is supplied as a Python lambda, which is just an inline, unnamed function. The body of lambda multiplies each number by two. This map() function can be made slightly more complex by doubling only odd numbers, as shown in this code: numbers = [1, 2, 3, 4, 5] defdouble_odd(num):   if num % 2 == 0:     return num   else:     return num * 2   doubled = map(double_odd, numbers) #doubled == [2, 2, 6, 4, 10] Map functions are implemented differently in each language or platform that supports them, but all follow the same pattern. An iterable collection of objects is passed to a map function. Each item of the collection is then iterated over with the map function being applied to that iteration. The final result is a new collection where each of the original items is transformed by the map. Reduce functions Like maps, the reduce functions also work by applying a provided function to an iterable data structure. The key difference between the two is that the reduce function works to produce a single value from the input iterable. Using Python's built-in reduce() function, we can see how to produce a sum of integers, as follows: numbers = [1, 2, 3, 4, 5] sum = reduce(lambda x, y: x + y, numbers) #sum == 15 You probably noticed that unlike our map operation, the reduce lambda has two parameters (x and y in this case). The argument passed to x will be the accumulated value of all applications of the function so far, and y will receive the next value to be added to the accumulation. Parenthetically, the order of operations can be seen as ((((1 + 2) + 3) + 4) + 5). Alternatively, the steps are shown in the following list: x = 1, y = 2 x = 3, y = 3 x = 6, y = 4 x = 10, y = 5 x = 15 As this list demonstrates, the value of x is the cumulative sum of previous x and y values. As such, reduce functions are sometimes termed accumulate or fold functions. Regardless of their name, reduce functions serve the common purpose of combining pieces of a recursive data structure to produce a single value. Couchbase MapReduce Creating an index (or view) in Couchbase requires creating a map function written in JavaScript. When the view is created for the first time, the map function is applied to each document in the bucket containing the view. When you update a view, only new or modified documents are indexed. This behavior is known as incremental MapReduce. You can think of a basic map function in Couchbase as being similar to a SQL CREATE INDEX statement. Effectively, you are defining a column or a set of columns, to be indexed by the server. Of course, these are not columns, but rather properties of the documents to be indexed. Basic mapping To illustrate the process of creating a view, first imagine that we have a set of JSON documents as shown here: var books=[     { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow"     },     { "id": 2, "title": "The Godfather", "author": "Mario Puzzo"     },     { "id": 3, "title": "Wiseguy", "author": "Nicholas Pileggi"     } ]; Each document contains title and author properties. In Couchbase, to query these documents by either title or author, we'd first need to write a map function. Without considering how map functions are written in Couchbase, we're able to understand the process with vanilla JavaScript: books.map(function(book) {   return book.author; }); In the preceding snippet, we're making use of the built-in JavaScript array's map() function. Similar to the Python snippets we saw earlier, JavaScript's map() function takes a function as a parameter and returns a new array with mapped objects. In this case, we'll have an array with each book's author, as follows: ["Robert Ludlow", "Mario Puzzo", "Nicholas Pileggi"] At this point, we have a mapped collection that will be the basis for our author index. However, we haven't provided a means for the index to be able to refer back to its original document. If we were using a relational database, we'd have effectively created an index on the Title column with no way to get back to the row that contained it. With a slight modification to our map function, we are able to provide the key (the id property) of the document as well in our index: books.map(function(book) {   return [book.author, book.id]; }); In this slightly modified version, we're including the ID with the output of each author. In this way, the index has its document's key stored with its title. [["The Bourne Identity", 1], ["The Godfather", 2], ["Wiseguy", 3]] We'll soon see how this structure more closely resembles the values stored in a Couchbase index. Basic reducing Not every Couchbase index requires a reduce component. In fact, we'll see that Couchbase already comes with built-in reduce functions that will provide you with most of the reduce behavior you need. However, before relying on only those functions, it's important to understand why you'd use a reduce function in the first place. Returning to the preceding example of the map, let's imagine we have a few more documents in our set, as follows: var books=[     { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow"     },     { "id": 2, "title": "The Bourne Ultimatum", "author": "Robert Ludlow"     },     { "id": 3, "title": "The Godfather", "author": "Mario Puzzo"     },     { "id": 4, "title": "The Bourne Supremacy", "author": "Robert Ludlow"     },     { "id": 5, "title": "The Family", "author": "Mario Puzzo"     },  { "id": 6, "title": "Wiseguy", "author": "Nicholas Pileggi"     } ]; We'll still create our index using the same map function because it provides a way of accessing a book by its author. Now imagine that we want to know how many books an author has written, or (assuming we had more data) the average number of pages written by an author. These questions are not possible to answer with a map function alone. Each application of the map function knows nothing about the previous application. In other words, there is no way for you to compare or accumulate information about one author's book to another book by the same author. Fortunately, there is a solution to this problem. As you've probably guessed, it's the use of a reduce function. As a somewhat contrived example, consider this JavaScript: mapped = books.map(function (book) {     return ([book.id, book.author]); });   counts = {} reduced = mapped.reduce(function(prev, cur, idx, arr) { var key = cur[1];     if (! counts[key]) counts[key] = 0;     ++counts[key] }, null); This code doesn't quite accurately reflect the way you would count books with Couchbase but it illustrates the basic idea. You look for each occurrence of a key (author) and increment a counter when it is found. With Couchbase MapReduce, the mapped structure is supplied to the reduce() function in a better format. You won't need to keep track of items in a dictionary. Couchbase views At this point, you should have a general sense of what MapReduce is, where it came from, and how it will affect the creation of a Couchbase Server view. So without further ado, let's see how to write our first Couchbase view. In fact, there were two to choose from. The bucket we'll use is beer-sample. If you didn't install it, don't worry. You can add it by opening the Couchbase Console and navigating to the Settings tab. Here, you'll find the option to install the bucket, as shown next: First, you need to understand the document structures with which you're working. The following JSON object is a beer document (abbreviated for brevity): {  "name": "Sundog",  "type": "beer",  "brewery_id": "new_holland_brewing_company",  "description": "Sundog is an amber ale...",  "style": "American-Style Amber/Red Ale",  "category": "North American Ale" } As you can see, the beer documents have several properties. We're going to create an index to let us query these documents by name. In SQL, the query would look like this: SELECT Id FROM Beers WHERE Name = ? You might be wondering why the SQL example includes only the Id column in its projection. For now, just know that to query a document using a view with Couchbase, the property by which you're querying must be included in an index. To create that index, we'll write a map function. The simplest example of a map function to query beer documents by name is as follows: function(doc) {   emit(doc.name); } This body of the map function has only one line. It calls the built-in Couchbase emit() function. This function is used to signal that a value should be indexed. The output of this map function will be an array of names. The beer-sample bucket includes brewery data as well. These documents look like the following code (abbreviated for brevity): {   "name": "Thomas Hooker Brewing",   "city": "Bloomfield",   "state": "Connecticut",   "website": "http://www.hookerbeer.com/",   "type": "brewery" } If we reexamine our map function, we'll see an obvious problem; both the brewery and beer documents have a name property. When this map function is applied to the documents in the bucket, it will create an index with documents from either the brewery or beer documents. The problem is that Couchbase documents exist in a single container—the bucket. There is no namespace for a set of related documents. The solution has typically involved including a type or docType property on each document. The value of this property is used to distinguish one document from another. In the case of the beer-sample database, beer documents have type = "beer" and brewery documents have type = "brewery". Therefore, we are easily able to modify our map function to create an index only on beer documents: function(doc) {   if (doc.type == "beer") {     emit(doc.name);   } } The emit() function actually takes two arguments. The first, as we've seen, emits a value to be indexed. The second argument is an optional value and is used by the reduce function. Imagine that we want to count the number of beer types in a particular category. In SQL, we would write the following query: SELECT Category, COUNT(*) FROM Beers GROUP BY Category To achieve the same functionality with Couchbase Server, we'll need to use both map and reduce functions. First, let's write the map. It will create an index on the category property: function(doc) {   if (doc.type == "beer") {     emit(doc.category, 1);   } } The only real difference between our category index and our name index is that we're including an argument for the value parameter of the emit() function. What we'll do with that value is simply count them. This counting will be done in our reduce function: function(keys, values) {   return values.length; } In this example, the values parameter will be given to the reduce function as a list of all values associated with a particular key. In our case, for each beer category, there will be a list of ones (that is, [1, 1, 1, 1, 1, 1]). Couchbase also provides a built-in _count function. It can be used in place of the entire reduce function in the preceding example. Now that we've seen the basic requirements when creating an actual Couchbase view, it's time to add a view to our bucket. The easiest way to do so is to use the Couchbase Console. Summary In this article, you learned the purpose of secondary indexes in a key/value store. We dug deep into MapReduce, both in terms of its history in functional languages and as a tool for NoSQL and big data systems. Resources for Article: Further resources on this subject: Map Reduce? [article] Introduction to Mapreduce [article] Working with Apps Splunk [article]
Read more
  • 0
  • 0
  • 4795

article-image-introducing-splunk
Packt
03 Mar 2015
14 min read
Save for later

Introducing Splunk

Packt
03 Mar 2015
14 min read
In this article by Betsy Page Sigman, author of the book Splunk Essentials, Splunk, whose "name was inspired by the process of exploring caves, or splunking, helps analysts, operators, programmers, and many others explore data from their organizations by obtaining, analyzing, and reporting on it. This multinational company, cofounded by Michael Baum, Rob Das, and Erik Swan, has a core product called "Splunk Enterprise. This manages searches, inserts, deletes, and filters, and analyzes big data that is generated by machines, as well as other types of data. "They also have a free version that has most of the capabilities of Splunk Enterprise and is an excellent learning tool. (For more resources related to this topic, see here.) Understanding events, event types, and fields in Splunk An understanding of events and event types is important before going further. Events In Splunk, an event is not just one of" the many local user meetings that are set up between developers to help each other out (although those can be very useful), "but also refers to a record of one activity that is recorded in a log file. Each event usually has: A timestamp indicating the date and exact time the event was created Information about what happened on the system that is being tracked Event types An event type is a way to allow "users to categorize similar events. It is field-defined by the user. You can define an event type in several ways, and the easiest way is by using the SplunkWeb interface. One common reason for setting up an event type is to examine why a system has failed. Logins are often problematic for systems, and a search for failed logins can help pinpoint problems. For an interesting example of how to save "a search on failed logins as an event type, visit http://docs.splunk.com/Documentation/Splunk/6.1.3/Knowledge/ClassifyAndGroupSimilarEvents#Save_a_search_as_a_new_event_type. Why are events and event types so important in Splunk? Because without events, there would be nothing to search, of course. And event types allow us to make meaningful searches easily and quickly according to our needs, as we'll see later. Sourcetypes Sourcetypes are also "important to understand, as they help define the rules for an event. A sourcetype is one of the default fields that Splunk assigns to data as it comes into the system. It determines what type of data it is so that Splunk can format it appropriately as it indexes it. This also allows the user who wants to search the "data to easily categorize it. Some of the common sourcetypes are listed as follows: access_combined, for "NCSA combined format HTTP web server logs apache_error, for standard "Apache web server error logs cisco_syslog, for the "standard syslog produced by Cisco network devices (including PIX firewalls, routers, and ACS), usually via remote syslog to a central log host websphere_core, a core file" export from WebSphere (Source: http://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter) Fields Each event in Splunk is" associated with a number of fields. The core fields of host, course, sourcetype, and timestamp are key to Splunk. These fields are extracted from events at multiple points in the data processing pipeline that Splunk uses, and each of these fields includes a name and a value. The name describes the field (such as the userid) and the value says what that field's value is (susansmith, for example). Some of these fields are default fields that are given because of where the event came from or what it is. When data is processed by Splunk, and when it is indexed or searched, it uses these fields. For indexing, the default fields added include those of host, source, and sourcetype. When searching, Splunk is able to select from a bevy of fields that can either be defined by the user or are very basic, such as action results in a purchase (for a website event). Fields are essential for doing the basic work of Splunk – that is, indexing and searching. Getting data into Splunk It's time to spring into action" now and input some data into Splunk. Adding data is "simple, easy, and quick. In this section, we will use some data and tutorials created by Splunk to learn how to add data: Firstly, to obtain your data, visit the tutorial data at http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk that is readily available on Splunk. Here, download the folder tutorialdata.zip. Note that this will be a fresh dataset that has been collected over the last 7 days. Download it but don't extract the data from it just yet. You then need to log in to Splunk, using admin as the username and then by using your password. Once logged in, you will notice that toward the upper-right corner of your screen is the button Add Data, as shown in the following screenshot. Click "on this button: Button to Add Data Once you have "clicked on this button, you'll see a screen" similar to the "following screenshot: Add Data to Splunk by Choosing a Data Type or Data Source Notice here the "different types of data that you can select, as "well as the different data sources. Since the data we're going to use is a file, under "Or Choose a Data Source, click on From files and directories. Once you have clicked on this, you can then click on the radio button next to Skip preview, as indicated in the following screenshot, since you don't need to preview the data" now. You then need to click on "Continue: Preview data You can download the tutorial files at: http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk As shown in the next screenshot, click on Upload and index a file, find the tutorialdata.zip file you just downloaded (it is probably in your Downloads folder), and then click on More settings, filling it in as shown in the following screenshot. (Note that you will need to select Segment in path under Host and type 1 under Segment Number.) Click on Save when you are done: Can specify source, additional settings, and source type Following this, you "should see a screen similar to the following" screenshot. Click on Start Searching, we will look at the data now: You should see this if your data has been successfully indexed into Splunk. You will now" see a screen similar to the following" screenshot. Notice that the number of events you have will be different, as will the time of the earliest event. At this point, click on Data Summary: The Search screen You should see the Data Summary screen like in the following screenshot. However, note that the Hosts shown here will not be the same as the ones you get. Take a quick look at what is on the Sources tab and the Sourcetypes tab. Then find the most recent data (in this case 127.0.0.1) and click on it. Data Summary, where you can see Hosts, Sources, and Sourcetypes After" clicking on the most recent data, which in "this case is bps-T341s, look at the events contained there. Later, when we use streaming data, we can see how the events at the top of this list change rapidly. Here, you will see a listing of events, similar to those shown in the "following screenshot: Events lists for the host value You can click on the Splunk logo in the upper-left corner "of the web page to return to the home page. Under Administrator at the "top-right of the page, click on Logout. Searching Twitter data We will start here by doing a simple search of our Twitter index, which is automatically created by the app once you have enabled Twitter input (as explained previously). In our earlier searches, we used the default index (which the tutorial data was downloaded to), so we didn't have to specify the index we wanted to use. Here, we will use just the Twitter index, so we need to specify that in the search. A simple search Imagine that we wanted to search for tweets containing the word coffee. We could use the code presented here and place it in the search bar: index=twitter text=*coffee* The preceding code searches only your Twitter index and finds all the places where the word coffee is mentioned. You have to put asterisks there, otherwise you will only get the tweets with just "coffee". (Note that the text field is not case sensitive, so tweets with either "coffee" or "Coffee" will be included in the search results.) The asterisks are included before and after the text "coffee" because otherwise we would only get events where just "coffee" was tweeted – a rather rare occurrence, we expect. In fact, when we search our indexed Twitter data without the asterisks around coffee, we got no results. Examining the Twitter event Before going further, it is useful to stop and closely examine the events that are collected as part of the search. The sample tweet shown in the following screenshot shows the large number of fields that are part of each tweet. The > was clicked to expand the event: A Twitter event There are several items to look closely at here: _time: Splunk assigns a timestamp for every event. This is done in UTC (Coordinated Universal Time) time format. contributors: The value for this field is null, as are the values of many Twitter fields. Retweeted_status: Notice the {+} here; in the following event list, you will see there are a number of fields associated with this, which can be seen when the + is selected and the list is expanded. This is the case wherever you see a {+} in a list of fields: Various retweet fields In addition to those shown previously, there are many other fields associated with a tweet. The 140 character (maximum) text field that most people consider to be the tweet is actually a small part of the actual data collected. The implied AND If you want to search on more than one term, there is no need to add AND as it is already implied. If, for example, you want to search for all tweets that include both the text "coffee" and the text "morning", then use: index=twitter text=*coffee* text=*morning* If you don't specify text= for the second term and just put *morning*, Splunk assumes that you want to search for *morning* in any field. Therefore, you could get that word in another field in an event. This isn't very likely in this case, although coffee could conceivably be part of a user's name, such as "coffeelover". But if you were searching for other text strings, such as a computer term like log or error, such terms could be found in a number of fields. So specifying the field you are interested in would be very important. The need to specify OR Unlike AND, you must always specify the word OR. For example, to obtain all events that mention either coffee or morning, enter: index=twitter text=*coffee* OR text=*morning* Finding other words used Sometimes you might want to find out what other words are used in tweets about coffee. You can do that with the following search: index=twitter text=*coffee* | makemv text | mvexpand text | top 30 text This search first searches for the word "coffee" in a text field, then creates a multivalued field from the tweet, and then expands it so that each word is treated as a separate piece of text. Then it takes the top 30 words that it finds. You might be asking yourself how you would use this kind of information. This type of analysis would be of interest to a marketer, who might want to use words that appear to be associated with coffee in composing the script for an advertisement. The following screenshot shows the results that appear (1 of 2 pages). From this search, we can see that the words love, good, and cold might be words worth considering: Search of top 30 text fields found with *coffee* When you do a search like this, you will notice that there are a lot of filler words (a, to, for, and so on) that appear. You can do two things to remedy this. You can increase the limit for top words so that you can see more of the words that come up, or you can rerun the search using the following code. "Coffee" (with a capital C) is listed (on the unshown second page) separately here from "coffee". The reason for this is that while the search is not case sensitive (thus both "coffee" and "Coffee" are picked up when you search on "coffee"), the process of putting the text fields through the makemv and the mvexpand processes ends up distinguishing on the basis of case. We could rerun the search, excluding some of the filler words, using the code shown here: index=twitter text=*coffee* | makemv text | mvexpand text |search NOT text="RT" AND NOT text="a" AND NOT text="to" ANDNOT text="the" | top 30 text Using a lookup table Sometimes it is useful to use a lookup file to avoid having to use repetitive code. It would help us to have a list of all the small words that might be found often in a tweet just by the nature of each word's frequent use in language, so that we might eliminate them from our quest to find words that would be relevant for use in the creation of advertising. If we had a file of such small words, we could use a command indicating not to use any of these more common, irrelevant words when listing the top 30 words associated with our search topic of interest. Thus, for our search for words associated with the text "coffee", we would be interested in words like " dark", "flavorful", and "strong", but not words like "a", "the", and "then". We can do this using a lookup command. There are three types of lookup commands, which are presented in the following table: Command Description lookup Matches a value of one field with a value of another, based on a .csv file with the two fields. Consider a lookup table named lutable that contains fields for machine_name and owner. Consider what happens when the following code snippet is used after a preceding search (indicated by . . . |): . . . | lookup lutable owner Splunk will use the lookup table to match the owner's name with its machine_name and add the machine_name to each event. inputlookup All fields in the .csv file are returned as results. If the following code snippet is used, both machine_name and owner would be searched: . . . | inputlookup lutable outputlookup This code outputs search results to a lookup table. The following code outputs results from the preceding research directly into a table it creates: . . . | outputlookup newtable.csv saves The command we will use here is inputlookup, because we want to reference a .csv file we can create that will include words that we want to filter out as we seek to find possible advertising words associated with coffee. Let's call the .csv file filtered_words.csv, and give it just a single text field, containing words like "is", "the", and "then". Let's rewrite the search to look like the following code: index=twitter text=*coffee*| makemv text | mvexpand text| search NOT [inputlookup filtered_words | fields text ]| top 30 text Using the preceding code, Splunk will search our Twitter index for *coffee*, and then expand the text field so that individual words are separated out. Then it will look for words that do NOT match any of the words in our filtered_words.csv file, and finally output the top 30 most frequently found words among those. As you can see, the lookup table can be very useful. To learn more about Splunk lookup tables, go to http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchReference/Lookup. Summary In this article, we have learned more about how to use Splunk to create reports, dashboards. Splunk Enterprise Software, or Splunk, is an extremely powerful tool for searching, exploring, and visualizing data of all types. Splunk is becoming increasingly popular, as more and more businesses, both large and small, discover its ease and usefulness. Analysts, managers, students, and others can quickly learn how to use the data from their systems, networks, web traffic, and social media to make attractive and informative reports. This is a straightforward, practical, and quick introduction to Splunk that should have you making reports and gaining insights from your data in no time. Resources for Article: Further resources on this subject: Lookups [article] Working with Apps in Splunk [article] Loading data, creating an app, and adding dashboards and reports in Splunk [article]
Read more
  • 0
  • 0
  • 11723
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-scipy-signal-processing
Packt
03 Mar 2015
14 min read
Save for later

SciPy for Signal Processing

Packt
03 Mar 2015
14 min read
In this article by Sergio J. Rojas G. and Erik A Christensen, authors of the book Learning SciPy for Numerical and Scientific Computing - Second Edition, we will focus on the usage of some most commonly used routines that are included in SciPy modules—scipy.signal, scipy.ndimage, and scipy.fftpack, which are used for signal processing, multidimensional image processing, and computing Fourier transforms, respectively. We define a signal as data that measures either a time-varying or spatially varying phenomena. Sound or electrocardiograms are excellent examples of time-varying quantities, while images embody the quintessential spatially varying cases. Moving images are treated with the techniques of both types of signals, obviously. The field of signal processing treats four aspects of this kind of data: its acquisition, quality improvement, compression, and feature extraction. SciPy has many routines to treat effectively tasks in any of the four fields. All these are included in two low-level modules (scipy.signal being the main module, with an emphasis on time-varying data, and scipy.ndimage, for images). Many of the routines in these two modules are based on Discrete Fourier Transform of the data. SciPy has an extensive package of applications and definitions of these background algorithms, scipy.fftpack, which we will start covering first. (For more resources related to this topic, see here.) Discrete Fourier Transforms The Discrete Fourier Transform (DFT from now on) transforms any signal from its time/space domain into a related signal in the frequency domain. This allows us not only to be able to analyze the different frequencies of the data, but also for faster filtering operations, when used properly. It is possible to turn a signal in the frequency domain back to its time/spatial domain; thanks to the Inverse Fourier Transform. We will not go into detail of the mathematics behind these operators, since we assume familiarity at some level with this theory. We will focus on syntax and applications instead. The basic routines in the scipy.fftpack module compute the DFT and its inverse, for discrete signals in any dimension, which are fft and ifft (one dimension), fft2 and ifft2 (two dimensions), and fftn and ifftn (any number of dimensions). All of these routines assume that the data is complex valued. If we know beforehand that a particular dataset is actually real valued, and should offer real-valued frequencies, we use rfft and irfft instead, for a faster algorithm. All these routines are designed so that composition with their inverses always yields the identity. The syntax is the same in all cases, as follows: fft(x[, n, axis, overwrite_x]) The first parameter, x, is always the signal in any array-like form. Note that fft performs one-dimensional transforms. This means in particular, that if x happens to be two-dimensional, for example, fft will output another two-dimensional array where each row is the transform of each row of the original. We can change it to columns instead, with the optional parameter, axis. The rest of parameters are also optional; n indicates the length of the transform, and overwrite_x gets rid of the original data to save memory and resources. We usually play with the integer n when we need to pad the signal with zeros, or truncate it. For higher dimension, n is substituted by shape (a tuple), and axis by axes (another tuple). To better understand the output, it is often useful to shift the zero frequencies to the center of the output arrays with fftshift. The inverse of this operation, ifftshift, is also included in the module. The following code shows some of these routines in action, when applied to a checkerboard image: >>> import numpy >>> from scipy.fftpack import fft,fft2, fftshift >>> import matplotlib.pyplot as plt >>> B=numpy.ones((4,4)); W=numpy.zeros((4,4)) >>> signal = numpy.bmat("B,W;W,B") >>> onedimfft = fft(signal,n=16) >>> twodimfft = fft2(signal,shape=(16,16)) >>> plt.figure() >>> plt.gray() >>> plt.subplot(121,aspect='equal') >>> plt.pcolormesh(onedimfft.real) >>> plt.colorbar(orientation='horizontal') >>> plt.subplot(122,aspect='equal') >>> plt.pcolormesh(fftshift(twodimfft.real)) >>> plt.colorbar(orientation='horizontal') >>> plt.show() Note how the first four rows of the one-dimensional transform are equal (and so are the last four), while the two-dimensional transform (once shifted) presents a peak at the origin, and nice symmetries in the frequency domain. In the following screenshot (obtained from the preceding code), the left-hand side image is fft and the right-hand side image is fft2 of a 2 x 2 checkerboard signal: The scipy.fftpack module also offers the Discrete Cosine Transform with its inverse (dct, idct) as well as many differential and pseudo-differential operators defined in terms of all these transforms: diff (for derivative/integral), hilbert and ihilbert (for the Hilbert transform), tilbert and itilbert (for the h-Tilbert transform of periodic sequences), and so on. Signal construction To aid in the construction of signals with predetermined properties, the scipy.signal module has a nice collection of the most frequent one-dimensional waveforms in the literature: chirp and sweep_poly (for the frequency-swept cosine generator), gausspulse (a Gaussian modulated sinusoid) and sawtooth and square (for the waveforms with those names). They all take as their main parameter a one-dimensional ndarray representing the times at which the signal is to be evaluated. Other parameters control the design of the signal, according to frequency or time constraints. Let's take a look into the following code snippet, which illustrates the use of these one dimensional waveforms that we just discussed: >>> import numpy >>> from scipy.signal import chirp, sawtooth, square, gausspulse >>> import matplotlib.pyplot as plt >>> t=numpy.linspace(-1,1,1000) >>> plt.subplot(221); plt.ylim([-2,2]) >>> plt.plot(t,chirp(t,f0=100,t1=0.5,f1=200))   # plot a chirp >>> plt.subplot(222); plt.ylim([-2,2]) >>> plt.plot(t,gausspulse(t,fc=10,bw=0.5))     # Gauss pulse >>> plt.subplot(223); plt.ylim([-2,2]) >>> t*=3*numpy.pi >>> plt.plot(t,sawtooth(t))                     # sawtooth >>> plt.subplot(224); plt.ylim([-2,2]) >>> plt.plot(t,square(t))                       # Square wave >>> plt.show() Generated by this code, the following diagram shows waveforms for chirp (upper-left), gausspulse (upper-right), sawtooth (lower-left), and square (lower-right): The usual method of creating signals is to import them from the file. This is possible by using purely NumPy routines, for example fromfile: fromfile(file, dtype=float, count=-1, sep='') The file argument may point to either a file or a string, the count argument is used to determine the number of items to read, and sep indicates what constitutes a separator in the original file/string. For images, we have the versatile routine, imread in either the scipy.ndimage or scipy.misc module: imread(fname, flatten=False) The fname argument is a string containing the location of an image. The routine infers the type of file, and reads the data into an array, accordingly. In case the flatten argument is turned to True, the image is converted to gray scale. Note that, in order to work, the Python Imaging Library (PIL) needs to be installed. It is also possible to load .wav files for analysis, with the read and write routines from the wavfile submodule in the scipy.io module. For instance, given any audio file with this format, say audio.wav, the command, rate,data = scipy.io.wavfile.read("audio.wav"), assigns an integer value to the rate variable, indicating the sample rate of the file (in samples per second), and a NumPy ndarray to the data variable, containing the numerical values assigned to the different notes. If we wish to write some one-dimensional ndarray data into an audio file of this kind, with the sample rate given by the rate variable, we may do so by issuing the following command: >>> scipy.io.wavfile.write("filename.wav",rate,data) Filters A filter is an operation on signals that either removes features or extracts some component. SciPy has a very complete set of known filters, as well as the tools to allow construction of new ones. The complete list of filters in SciPy is long, and we encourage the reader to explore the help documents of the scipy.signal and scipy.ndimage modules for the complete picture. We will introduce in these pages, as an exposition, some of the most used filters in the treatment of audio or image processing. We start by creating a signal worth filtering: >>> from numpy import sin, cos, pi, linspace >>> f=lambda t: cos(pi*t) + 0.2*sin(5*pi*t+0.1) + 0.2*sin(30*pi*t)    + 0.1*sin(32*pi*t+0.1) + 0.1*sin(47* pi*t+0.8) >>> t=linspace(0,4,400); signal=f(t) We first test the classical smoothing filter of Wiener and Kolmogorov, wiener. We present in a plot, the original signal (in black) and the corresponding filtered data, with a choice of a Wiener window of the size 55 samples (in blue). Next, we compare the result of applying the median filter, medfilt, with a kernel of the same size as before (in red): >>> from scipy.signal import wiener, medfilt >>> import matplotlib.pylab as plt >>> plt.plot(t,signal,'k') >>> plt.plot(t,wiener(signal,mysize=55),'r',linewidth=3) >>> plt.plot(t,medfilt(signal,kernel_size=55),'b',linewidth=3) >>> plt.show() This gives us the following graph showing the comparison of smoothing filters (wiener is the one that has its starting point just below 0.5 and medfilt has its starting point just above 0.5): Most of the filters in the scipy.signal module can be adapted to work in arrays of any dimension. But in the particular case of images, we prefer to use the implementations in the scipy.ndimage module, since they are coded with these objects in mind. For instance, to perform a median filter on an image for smoothing, we use scipy.ndimage.median_filter. Let's see an example. We will start by loading Lena to the array and corrupting the image with Gaussian noise (zero mean and standard deviation of 16): >>> from scipy.stats import norm     # Gaussian distribution >>> import matplotlib.pyplot as plt >>> import scipy.misc >>> import scipy.ndimage >>> plt.gray() >>> lena=scipy.misc.lena().astype(float) >>> plt.subplot(221); >>> plt.imshow(lena) >>> lena+=norm(loc=0,scale=16).rvs(lena.shape) >>> plt.subplot(222); >>> plt.imshow(lena) >>> denoised_lena = scipy.ndimage.median_filter(lena,3) >>> plt.subplot(224); >>> plt.imshow(denoised_lena) The set of filters for images come in two flavors—statistical and morphological. For example, among the filters of statistical nature, we have the Sobel algorithm oriented to detection of edges (singularities along curves). Its syntax is as follows: sobel(image, axis=-1, output=None, mode='reflect', cval=0.0) The optional parameter, axis, indicates the dimension in which the computations are performed. By default, this is always the last axis (-1). The mode parameter, which is one of the strings 'reflect', 'constant', 'nearest', 'mirror', or 'wrap', indicates how to handle the border of the image, in case there is insufficient data to perform the computations there. In case the mode is 'constant', we may indicate the value to use in the border, with the cval parameter. Let's look into the following code snippet, which illustrates the use of the sobel filter: >>> from scipy.ndimage.filters import sobel >>> import numpy >>> lena=scipy.misc.lena() >>> sblX=sobel(lena,axis=0); sblY=sobel(lena,axis=1) >>> sbl=numpy.hypot(sblX,sblY) >>> plt.subplot(223); >>> plt.imshow(sbl) >>> plt.show() The following screenshot illustrates Lena (upper-left) and noisy Lena (upper-right) with the preceding two filters in action—edge map with sobel (lower-left) and median filter (lower-right): Morphology We also have the possibility of creating and applying filters to images based on mathematical morphology, both to binary and gray-scale images. The four basic morphological operations are opening (binary_opening), closing (binary_closing), dilation (binary_dilation), and erosion (binary_erosion). Note that the syntax for each of these filters is very simple, since we only need two ingredients—the signal to filter and the structuring element to perform the morphological operation. Let's take a look into the general syntax for these morphological operations: binary_operation(signal, structuring_element) We may use combinations of these four basic morphological operations to create more complex filters for removal of holes, hit-or-miss transforms (to find the location of specific patterns in binary images), denoising, edge detection, and many more. The SciPy module also allows for creating some common filters using the preceding syntax. For instance, for the location of the letter e in a text, we could use the following command instead: >>> binary_hit_or_miss(text, letterE) For comparative purposes, let's use this command in the following code snippet: >>> import numpy >>> import scipy.ndimage >>> import matplotlib.pylab as plt >>> from scipy.ndimage.morphology import binary_hit_or_miss >>> text = scipy.ndimage.imread('CHAP_05_input_textImage.png') >>> letterE = text[37:53,275:291] >>> HitorMiss = binary_hit_or_miss(text, structure1=letterE,    origin1=1) >>> eLocation = numpy.where(HitorMiss==True) >>> x=eLocation[1]; y=eLocation[0] >>> plt.imshow(text, cmap=plt.cm.gray, interpolation='nearest') >>> plt.autoscale(False) >>> plt.plot(x,y,'wo',markersize=10) >>> plt.axis('off') >>> plt.show() The output for the preceding lines of code is generated as follows: For gray-scale images, we may use a structuring element (structuring_element) or a footprint. The syntax is, therefore, a little different: grey_operation(signal, [structuring_element, footprint, size, ...]) If we desire to use a completely flat and rectangular structuring element (all ones), then it is enough to indicate the size as a tuple. For instance, to perform gray-scale dilation of a flat element of size (15,15) on our classical image of Lena, we issue the following command: >>> grey_dilation(lena, size=(15,15)) The last kind of morphological operations coded in the scipy.ndimage module perform distance and feature transforms. Distance transforms create a map that assigns to each pixel, the distance to the nearest object. Feature transforms provide with the index of the closest background element instead. These operations are used to decompose images into different labels. We may even choose different metrics such as Euclidean distance, chessboard distance, and taxicab distance. The syntax for the distance transform (distance_transform) using a brute force algorithm is as follows: distance_transform_bf(signal, metric='euclidean', sampling=None, return_distances=True, return_indices=False,                      distances=None, indices=None) We indicate the metric with the strings such as 'euclidean', 'taxicab', or 'chessboard'. If we desire to provide the feature transform instead, we switch return_distances to False and return_indices to True. Similar routines are available with more sophisticated algorithms—distance_transform_cdt (using chamfering for taxicab and chessboard distances). For Euclidean distance, we also have distance_transform_edt. All these use the same syntax. Summary In this article, we explored signal processing (any dimensional) including the treatment of signals in frequency space, by means of their Discrete Fourier Transforms. These correspond to the fftpack, signal, and ndimage modules. Resources for Article: Further resources on this subject: Signal Processing Techniques [article] SciPy for Computational Geometry [article] Move Further with NumPy Modules [article]
Read more
  • 0
  • 0
  • 13934

article-image-time-travelling-spring
Packt
03 Mar 2015
18 min read
Save for later

Time Travelling with Spring

Packt
03 Mar 2015
18 min read
This article by Sujoy Acharya, the author of the book Mockito for Spring, delves into the details Time Travelling with Spring. Spring 4.0 is the Java 8-enabled latest release of the Spring Framework. In this article, we'll discover the major changes in the Spring 4.x release and the four important features of the Spring 4 framework. We will cover the following topics in depth: @RestController AsyncRestTemplate Async tasks Caching (For more resources related to this topic, see here.) Discovering the new Spring release This section deals with the new features and enhancements in Spring Framework 4.0. The following are the features: Spring 4 supports Java 8 features such as Java lambda expressions and java.time. Spring 4 supports JDK 6 as the minimum. All deprecated packages/methods are removed. Java Enterprise Edition 6 or 7 are the base of Spring 4, which is based on JPA 2 and Servlet 3.0. Bean configuration using the Groovy DSL is supported in Spring Framework 4.0. Hibernate 4.3 is supported by Spring 4. Custom annotations are supported in Spring 4. Autowired lists and arrays can be ordered. The @Order annotation and the Ordered interface are supported. The @Lazy annotation can now be used on injection points as well as on the @Bean definitions. For the REST application, Spring 4 provides a new @RestController annotation. We will discuss this in detail in the following section. The AsyncRestTemplate feature (class) is added for asynchronous REST client development. Different time zones are supported in Spring 4.0. New spring-websocket and spring-messaging modules have been added. The SocketUtils class is added to examine the free TCP and UDP server ports on localhost. All the mocks under the org.springframework.mock.web package are now based on the Servlet 3.0 specification. Spring supports JCache annotations and new improvements have been made in caching. The @Conditional annotation has been added to conditionally enable or disable an @Configuration class or even individual @Bean methods. In the test module, SQL script execution can now be configured declaratively via the new @Sql and @SqlConfig annotations on a per-class or per-method basis. You can visit the Spring Framework reference at http://docs.spring.io/spring/docs/4.1.2.BUILD-SNAPSHOT/spring-framework-reference/htmlsingle/#spring-whats-new for more details. Also, you can watch a video at http://zeroturnaround.com/rebellabs/spring-4-on-java-8-geekout-2013-video/ for more details on the changes in Spring 4. Working with asynchronous tasks Java 7 has a feature called Future. Futures let you retrieve the result of an asynchronous operation at a later time. The FutureTask class runs in a separate thread, which allows you to perform non-blocking asynchronous operations. Spring provides an @Async annotation to make it more easier to use. We'll explore Java's Future feature and Spring's @Async declarative approach: Create a project, TimeTravellingWithSpring, and add a package, com.packt.async. We'll exercise a bank's use case, where an automated job will run and settle loan accounts. It will also find all the defaulters who haven't paid the loan EMI for a month and then send an SMS to their number. The job takes time to process thousands of accounts, so it will be good if we can send SMSes asynchronously to minimize the burden of the job. We'll create a service class to represent the job, as shown in the following code snippet: @Service public class AccountJob {    @Autowired    private SMSTask smsTask; public void process() throws InterruptedException, ExecutionException { System.out.println("Going to find defaulters... "); Future<Boolean> asyncResult =smsTask.send("1", "2", "3"); System.out.println("Defaulter Job Complete. SMS will be sent to all defaulter"); Boolean result = asyncResult.get(); System.out.println("Was SMS sent? " + result); } } The job class autowires an SMSTask class and invokes the send method with phone numbers. The send method is executed asynchronously and Future is returned. When the job calls the get() method on Future, a result is returned. If the result is not processed before the get() method invocation, the ExecutionException is thrown. We can use a timeout version of the get() method. Create the SMSTask class in the com.packt.async package with the following details: @Component public class SMSTask { @Async public Future<Boolean> send(String... numbers) { System.out.println("Selecting SMS format "); try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); return new AsyncResult<>(false); } System.out.println("Async SMS send task is Complete!!!"); return new AsyncResult<>(true); } } Note that the method returns Future, and the method is annotated with @Async to signify asynchronous processing. Create a JUnit test to verify asynchronous processing: @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(locations="classpath:com/packt/async/          applicationContext.xml") public class AsyncTaskExecutionTest { @Autowired ApplicationContext context; @Test public void jobTest() throws Exception { AccountJob job = (AccountJob)context.getBean(AccountJob.class); job.process(); } } The job bean is retrieved from the applicationContext file and then the process method is called. When we execute the test, the following output is displayed: Going to find defaulters... Defaulter Job Complete. SMS will be sent to all defaulter Selecting SMS format Async SMS send task is Complete!!! Was SMS sent? true During execution, you might feel that the async task is executed after a delay of 2 seconds as the SMSTask class waits for 2 seconds. Exploring @RestController JAX-RS provides the functionality for Representational State Transfer (RESTful) web services. REST is well-suited for basic, ad hoc integration scenarios. Spring MVC offers controllers to create RESTful web services. In Spring MVC 3.0, we need to explicitly annotate a class with the @Controller annotation in order to specify a controller servlet and annotate each and every method with @ResponseBody to serve JSON, XML, or a custom media type. With the advent of the Spring 4.0 @RestController stereotype annotation, we can combine @ResponseBody and @Controller. The following example will demonstrate the usage of @RestController: Create a dynamic web project, RESTfulWeb. Modify the web.xml file and add a configuration to intercept requests with a Spring DispatcherServlet: <web-app xsi_schemaLocation="http:// java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/webapp_ 3_0.xsd" id="WebApp_ID" version="3.0"> <display-name>RESTfulWeb</display-name> <servlet> <servlet-name>dispatcher</servlet-name> <servlet-class> org.springframework.web.servlet.DispatcherServlet </servlet-class> <load-on-startup>1</load-on-startup> </servlet> <servlet-mapping> <servlet-name>dispatcher</servlet-name> <url-pattern>/</url-pattern> </servlet-mapping> <context-param> <param-name>contextConfigLocation</param-name> <param-value> /WEB-INF/dispatcher-servlet.xml </param-value> </context-param> </web-app> The DispatcherServlet expects a configuration file with the naming convention [servlet-name]-servlet.xml. Create an application context XML, dispatcher-servlet.xml. We'll use annotations to configure Spring beans, so we need to tell the Spring container to scan the Java package in order to craft the beans. Add the following lines to the application context in order to instruct the container to scan the com.packt.controller package: <context:component-scan base-package= "com.packt.controller" /> <mvc:annotation-driven /> We need a REST controller class to handle the requests and generate a JSON output. Go to the com.packt.controller package and add a SpringService controller class. To configure the class as a REST controller, we need to annotate it with the @RestController annotation. The following code snippet represents the class: @RestController @RequestMapping("/hello") public class SpringService { private Set<String> names = new HashSet<String>(); @RequestMapping(value = "/{name}", method =          RequestMethod.GET) public String displayMsg(@PathVariable String name) {    String result = "Welcome " + name;    names.add(name);    return result; } @RequestMapping(value = "/all/", method =          RequestMethod.GET) public String anotherMsg() {    StringBuilder result = new StringBuilder("We          greeted so far ");    for(String name:names){      result.append(name).append(", ");    }    return result.toString();  } } We annotated the class with @RequestMapping("/hello"). This means that the SpringService class will cater for the requests with the http://{site}/{context}/hello URL pattern, or since we are running the app in localhost, the URL can be http://localhost:8080/RESTfulWeb/hello. The displayMsg method is annotated with @RequestMapping(value = "/{name}", method = RequestMethod.GET). So, the method will handle all HTTP GET requests with the URL pattern /hello/{name}. The name can be any String, such as /hello/xyz or /hello/john. In turn, the method stores the name to Set for later use and returns a greeting message, welcome {name}. The anotherMsg method is annotated with @RequestMapping(value = "/all/", method = RequestMethod.GET), which means that the method accepts all the requests with the http://{SITE}/{Context}/hello/all/ URL pattern. Moreover, this method builds a list of all users who visited the /hello/{names} URL. Remember, the displayMsg method stores the names in Set; this method iterates Set and builds a list of names who visited the /hello/{name} URL. There is some confusion though: what will happen if you enter the /hello/all URL in the browser? When we pass only a String literal after /hello/, the displayMsg method handles it, so you will be greeted with welcome all. However, if you type /hello/all/ instead—note that we added a slash after all—it means that the URL does not match the /hello/{name} pattern and the second method will handle the request and show you the list of users who visited the first URL. When we run the application and access the /hello/{name} URL, the following output is displayed: When we access http://localhost:8080/RESTfulWeb/hello/all/, the following output is displayed: Therefore, our RESTful application is ready for use, but just remember that in the real world, you need to secure the URLs against unauthorized access. In a web service, development security plays a key role. You can read the Spring security reference manual for additional information. Learning AsyncRestTemplate We live in a small, wonderful world where everybody is interconnected and impatient! We are interconnected through technology and applications, such as social networks, Internet banking, telephones, chats, and so on. Likewise, our applications are interconnected; often, an application housed in India may need to query an external service hosted in Philadelphia to get some significant information. We are impatient as we expect everything to be done in seconds; we get frustrated when we make an HTTP call to a remote service, and this blocks the processing unless the remote response is back. We cannot finish everything in milliseconds or nanoseconds, but we can process long-running tasks asynchronously or in a separate thread, allowing the user to work on something else. To handle RESTful web service calls asynchronously, Spring offers two useful classes: AsyncRestTemplate and ListenableFuture. We can make an async call using the template and get Future back and then continue with other processing, and finally we can ask Future to get the result. This section builds an asynchronous RESTful client to query the RESTful web service we developed in the preceding section. The AsyncRestTemplate class defines an array of overloaded methods to access RESTful web services asynchronously. We'll explore the exchange and execute methods. The following are the steps to explore the template: Create a package, com.packt.rest.template. Add a AsyncRestTemplateTest JUnit test. Create an exchange() test method and add the following lines: @Test public void exchange(){ AsyncRestTemplate asyncRestTemplate = new AsyncRestTemplate(); String url ="http://localhost:8080/RESTfulWeb/ hello/all/"; HttpMethod method = HttpMethod.GET; Class<String> responseType = String.class; HttpHeaders headers = new HttpHeaders(); headers.setContentType(MediaType.TEXT_PLAIN); HttpEntity<String> requestEntity = new HttpEntity<String>("params", headers); ListenableFuture<ResponseEntity<String>> future = asyncRestTemplate.exchange(url, method, requestEntity, responseType); try { //waits for the result ResponseEntity<String> entity = future.get(); //prints body of the given URL System.out.println(entity.getBody()); } catch (InterruptedException e) { e.printStackTrace(); } catch (ExecutionException e) { e.printStackTrace(); } } The exchange() method has six overloaded versions. We used the method that takes a URL, an HttpMethod method such as GET or POST, an HttpEntity method to set the header, and finally a response type class. We called the exchange method, which in turn called the execute method and returned ListenableFuture. The ListenableFuture is the handle to our output; we invoked the GET method on ListenableFuture to get the RESTful service call response. The ResponseEntity has the getBody, getClass, getHeaders, and getStatusCode methods for extracting the web service call response. We invoked the http://localhost:8080/RESTfulWeb/hello/all/ URL and got back the following response: Now, create an execute test method and add the following lines: @Test public void execute(){ AsyncRestTemplate asyncTemp = new AsyncRestTemplate(); String url ="http://localhost:8080/RESTfulWeb /hello/reader"; HttpMethod method = HttpMethod.GET; HttpHeaders headers = new HttpHeaders(); headers.setContentType(MediaType.TEXT_PLAIN); AsyncRequestCallback requestCallback = new AsyncRequestCallback (){ @Override public void doWithRequest(AsyncClientHttpRequest request) throws IOException { System.out.println(request.getURI()); } }; ResponseExtractor<String> responseExtractor = new ResponseExtractor<String>(){ @Override public String extractData(ClientHttpResponse response) throws IOException { return response.getStatusText(); } }; Map<String,String> urlVariable = new HashMap<String, String>(); ListenableFuture<String> future = asyncTemp.execute(url, method, requestCallback, responseExtractor, urlVariable); try { //wait for the result String result = future.get(); System.out.println("Status =" +result); } catch (InterruptedException e) { e.printStackTrace(); } catch (ExecutionException e) { e.printStackTrace(); } } The execute method has several variants. We invoke the one that takes a URL, HttpMethod such as GET or POST, an AsyncRequestCallback method which is invoked from the execute method just before executing the request asynchronously, a ResponseExtractor to extract the response, such as a response body, status code or headers, and a URL variable such as a URL that takes parameters. We invoked the execute method and received a future, as our ResponseExtractor extracts the status code. So, when we ask the future to get the result, it returns the response status which is OK or 200. In the AsyncRequestCallback method, we invoked the request URI; hence, the output first displays the request URI and then prints the response status. The following is the output: Caching objects Scalability is a major concern in web application development. Generally, most web traffic is focused on some special set of information. So, only those records are queried very often. If we can cache these records, then the performance and scalability of the system will increase immensely. The Spring Framework provides support for adding caching into an existing Spring application. In this section, we'll work with EhCache, the most widely used caching solution. Download the latest EhCache JAR from the Maven repository; the URL to download version 2.7.2 is http://mvnrepository.com/artifact/net.sf.ehcache/ehcache/2.7.2. Spring provides two annotations for caching: @Cacheable and @CacheEvict. These annotations allow methods to trigger cache population or cache eviction, respectively. The @Cacheable annotation is used to identify a cacheable method, which means that for an annotate method the result is stored into the cache. Therefore, on subsequent invocations (with the same arguments), the value in the cache is returned without actually executing the method. The cache abstraction allows the eviction of cache for removing stale or unused data from the cache. The @CacheEvict annotation demarcates the methods that perform cache eviction, that is, methods that act as triggers to remove data from the cache. The following are the steps to build a cacheable application with EhCache: Create a serializable Employee POJO class in the com.packt.cache package to store the employee ID and name. The following is the class definition: public class Employee implements Serializable { private static final long serialVersionUID = 1L; private final String firstName, lastName, empId;   public Employee(String empId, String fName, String lName) {    this.firstName = fName;    this.lastName = lName;    this.empId = empId; //Getter methods Spring caching supports two storages: the ConcurrentMap and ehcache libraries. To configure caching, we need to configure a manager in the application context. The org.springframework.cache.ehcache.EhCacheCacheManager class manages ehcache. Then, we need to define a cache with a configurationLocation attribute. The configurationLocation attribute defines the configuration resource. The ehcache-specific configuration is read from the resource ehcache.xml. <beans   xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans- 4.1.xsd http://www.springframework.org/schema/cache http://www. springframework.org/schema/cache/spring-cache- 4.1.xsd http://www.springframework.org/schema/context http://www. springframework.org/schema/context/springcontext- 4.1.xsd "> <context:component-scan base-package= "com.packt.cache" /> <cache:annotation-driven/> <bean id="cacheManager" class="org.springframework.cache. ehcache.EhCacheCacheManager" p:cacheManager-ref="ehcache"/> <bean id="ehcache" class="org.springframework.cache. ehcache.EhCacheManagerFactoryBean" p:configLocation="classpath:com/packt/cache/ehcache.xml"/> </beans> The <cache:annotation-driven/> tag informs the Spring container that the caching and eviction is performed in annotated methods. We defined a cacheManager bean and then defined an ehcache bean. The ehcache bean's configLocation points to an ehcache.xml file. We'll create the file next. Create an XML file, ehcache.xml, under the com.packt.cache package and add the following cache configuration data: <ehcache>    <diskStore path="java.io.tmpdir"/>    <cache name="employee"            maxElementsInMemory="100"            eternal="false"            timeToIdleSeconds="120"            timeToLiveSeconds="120"            overflowToDisk="true"            maxElementsOnDisk="10000000"            diskPersistent="false"            diskExpiryThreadIntervalSeconds="120"            memoryStoreEvictionPolicy="LRU"/>   </ehcache> The XML configures many things. Cache is stored in memory, but memory has a limit, so we need to define maxElementsInMemory. EhCache needs to store data to disk when max elements in memory reaches the threshold limit. We provide diskStore for this purpose. The eviction policy is set as an LRU, but the most important thing is the cache name. The name employee will be used to access the cache configuration. Now, create a service to store the Employee objects in a HashMap. The following is the service: @Service public class EmployeeService { private final Map<String, Employee> employees = new ConcurrentHashMap<String, Employee>(); @PostConstruct public void init() { saveEmployee (new Employee("101", "John", "Doe")); saveEmployee (new Employee("102", "Jack", "Russell")); } @Cacheable("employee") public Employee getEmployee(final String employeeId) { System.out.println(String.format("Loading a employee with id of : %s", employeeId)); return employees.get(employeeId); } @CacheEvict(value = "employee", key = "#emp.empId") public void saveEmployee(final Employee emp) { System.out.println(String.format("Saving a emp with id of : %s", emp.getEmpId())); employees.put(emp.getEmpId(), emp); } } The getEmployee method is a cacheable method; it uses the cache employee. When the getEmployee method is invoked more than once with the same employee ID, the object is returned from the cache instead of the original method being invoked. The saveEmployee method is a CacheEvict method. Now, we'll examine caching. We'll call the getEmployee method twice; the first call will populate the cache and the subsequent call will be responded toby the cache. Create a JUnit test, CacheConfiguration, and add the following lines: @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(locations="classpath:com/packt/cache/ applicationContext.xml") public class CacheConfiguration { @Autowired ApplicationContext context; @Test public void jobTest() throws Exception { EmployeeService employeeService = (EmployeeService)context.getBean(EmployeeService.class); long time = System.currentTimeMillis(); employeeService.getEmployee("101"); System.out.println("time taken ="+(System.currentTimeMillis() - time)); time = System.currentTimeMillis(); employeeService.getEmployee("101"); System.out.println("time taken to read from cache ="+(System.currentTimeMillis() - time)); time = System.currentTimeMillis(); employeeService.getEmployee("102"); System.out.println("time taken ="+(System.currentTimeMillis() - time)); time = System.currentTimeMillis(); employeeService.getEmployee("102"); System.out.println("time taken to read from cache ="+(System.currentTimeMillis() - time)); employeeService.saveEmployee(new Employee("1000", "Sujoy", "Acharya")); time = System.currentTimeMillis(); employeeService.getEmployee("1000"); System.out.println("time taken ="+(System.currentTimeMillis() - time)); time = System.currentTimeMillis(); employeeService.getEmployee("1000"); System.out.println("time taken to read from cache ="+(System.currentTimeMillis() - time)); } } Note that the getEmployee method is invoked twice for each employee, and we recorded the method execution time in milliseconds. You will find from the output that every second call is answered by the cache, as the first call prints Loading a employee with id of : 101 and then the next call doesn't print the message but prints the time taken to execute. You will also find that the time taken for the cached objects is zero or less than the method invocation time. The following screenshot shows the output: Summary This article started with discovering the features of the new major Spring release 4.0, such as Java 8 support and so on. Then, we picked four Spring 4 topics and explored them one by one. The @Async section showcased the execution of long-running methods asynchronously and provided an example of how to handle asynchronous processing. The @RestController section eased the RESTful web service development with the advent of the @RestController annotation. The AsyncRestTemplate section explained the RESTful client code to invoke RESTful web service asynchronously. Caching is inevitable for a high-performance, scalable web application. The caching section explained the EhCache and Spring integrations to achieve a high-availability caching solution. Resources for Article: Further resources on this subject: Getting Started with Mockito [article] Progressive Mockito [article] Understanding outside-in [article]
Read more
  • 0
  • 0
  • 2002

article-image-packaged-elegance
Packt
03 Mar 2015
24 min read
Save for later

Packaged Elegance

Packt
03 Mar 2015
24 min read
In this article by John Farrar, author of the book KnockoutJS Web development, we will see how templates drove us to a more dynamic, creative platform. The next advancement in web development was custom HTML components. KnockoutJS allows us to jump right in with some game-changing elegance for designers and developers. In this article, we will focus on: An introduction to components Bring Your Own Tags (BYOT) Enhancing attribute handling Making your own libraries Asynchronous module definition (AMD)—on demand resource loading This entire article is about packaging your code for reuse. Using these techniques, you can make your code more approachable and elegant. (For more resources related to this topic, see here.) Introduction to components The best explanation of a component is a packaged template with an isolated ViewModel. Here is the syntax we would use to declare a like component on the page: <div data-bind="component: "like"''"></div> If you are passing no parameters through to the component, this is the correct syntax. If you wish to pass parameters through, you would use a JSON style structure as follows: <div data-bind="component:{name: 'like-widget',params:{ approve: like} }"></div> This would allow us to pass named parameters through to our custom component. In this case, we are passing a parameter named approve. This would mean we had a bound viewModel variable by the name of like. Look at how this would be coded. Create a page called components.html using the _base.html file to speed things up as we have done in all our other articles. In your script section, create the following ViewModel: <script>ViewModel = function(){self = this;self.like = ko.observable(true);};// insert custom component herevm = new ViewModel();ko.applyBindings(vm);</script> Now, we will create our custom component. Here is the basic component we will use for this first component. Place the code where the comment is, as we want to make sure it is added before our applyBindings method is executed: ko.components.register('like-widget', { viewModel: function(params) {    this.approve = params.approve;    // Behaviors:    this.toggle = function(){      this.approve(!this.approve());    }.bind(this); }, template:    '<div class="approve">      <button data-bind="click: toggle">        <span data-bind="visible: approve" class="glyphicon   glyphicon-thumbs-up"></span>        <span data-bind="visible:! approve()" class="glyphicon   glyphicon-thumbs-down"></span>      </button>    </div>' }); There are two sections to our components: the viewModel and template sections. In this article, we will be using Knockout template details inside the component. The standard Knockout component passes variables to the component using the params structure. We can either use this structure or you could optionally use the self = this approach if desired. In addition to setting the variable structure, it is also possible to create behaviors for our components. If we look in the template code, we can see we have data-bound the click event to toggle the approve setting in our component. Then, inside the button, by binding to the visible trait of the span element, either the thumbs up or thumbs down image will be shown to the user. Yes, we are using a Bootstrap icon element rather than a graphic here. Here is a screenshot of the initial state: When we click on the thumb image, it will toggle between the thumbs up and the thumbs down version. Since we also passed in the external variable that is bound to the page ViewModel, we see that the value in the matched span text will also toggle. Here is the markup we would add to the page to produce these results in the View section of our code: <div data-bind="component:   {name: 'like-widget', params:{ approve: like} }"></div> <span data-bind="text: like"></span> You could build this type of functionality with a jQuery plugin as well, but it is likely to take a bit more code to do two-way binding and match the tight functionality we have achieved here. This doesn't mean jQuery plugins are bad, as this is also a jQuery-related technology. What it does mean is we have ways to do things even better. It is this author's opinion that features like this would still make great additions to the core jQuery library. Yet, I am not holding my breath waiting for them to adopt a Knockout-type project to the wonderful collection of projects they have at this point, and do not feel we should hold that against them. Keeping focused on what they do best is one of the reasons libraries like Knockout can provide a wider array of options. It seems the decisions are working on our behalf even if they are taking a different approach than I expected. Dynamic component selection You should have noticed when we selected the component that we did so using a quoted declaration. While at first it may seem to be more constricting, remember that it is actually a power feature. By using a variable instead of a hardcoded value, you can dynamically select the component you would like to be inserted. Here is the markup code: <div data-bind="component:  { name: widgetName, params: widgetParams }"></div> <span data-bind="text:widgetParams.approve"></span> Notice that we are passing in both widgetName as well as widgetParams. Because we are binding the structure differently, we also need to show the bound value differently in our span. Here is the script part of our code that needs to be added to our viewModel code: self.widgetName = ko.observable("like-widget"); self.widgetParams = {    approve: ko.observable(true) }; We will get the same visible results but notice that each of the like buttons is acting independent of the other. What would happen if we put more than one of the same elements on the page? If we do that, Knockout components will act independent of other components. Well, most of the time they act independent. If we bound them to the same variable they would not be independent. In your viewModel declaration code, add another variable called like2 as follows: self.like2 = ko.observable(false); Now, we will add another like button to the page by copying our first like View code. This time, change the value from like to like2 as follows: <like-widget params="approve: like2"></like-widget> <span data-bind="text: like2"></span> This time when the page loads, the other likes display with a thumbs up, but this like will display with a thumbs down. The text will also show false stored in the bound value. Any of the like buttons will act independently because each of them is bound to unique values. Here is a screenshot of the third button: Bring Your Own Tags (BYOT) What is an element? Basically, an element is a component that you reach using the tag syntax. This is the way it is expressed in the official documentation at this point and it is likely to stay that way. It is still a component under the hood. Depending on the crowd you are in, this distinction will be more or less important. Mostly, just be aware of the distinction in case someone feels it is important, as that will let you be on the same page in discussions. Custom tags are a part of the forthcoming HTML feature called Web Components. Knockout allows you to start using them today. Here is the View code: <like-widget params="approve: like3"></like-widget> <span data-bind="text: like3"></span> You may want to code some tags with a single tag rather than a double tag, as in an opening and closing tag syntax. Well, at this time, there are challenges getting each browser to see the custom element tags when declared as a single tag. This means custom tags, or elements, will need to be declared as opening and closing tags for now. We will also need to create our like3 bound variable for viewModel with the following code: self.like3 = ko.observable(true); Running the code gives us the same wonderful functionality as our data-bind approach, but now we are creating our own HTML tags. Has there ever been a time you wanted a special HTML tag that just didn't exist? There is a chance you could create that now using Knockout component element-style coding. Enhancing attribute handling Now, while custom tags are awesome, there is just something different about passing everything in with a single param attribute. The reason for this is that this process matches how our tags work when we are using the data-bind approach to coding. In the following example, we will look at passing things in via individual attributes. This is not meant to work as a data-bind approach, but it is focused completely on the custom tag element component. The first thing you want to do is make sure this enhancement doesn't cause any issues with the normal elements. We did this by checking the custom elements for a standard prefix. You do not need to work through this code as it is a bit more advanced. The easiest thing to do is to include our Knockout components tag with the following script tag: <script src="/share/js/knockout.komponents.js"></script> In this tag, we have this code segment to convert the tags that start with kom- to tags that use individual attributes rather than a JSON translation of the attributes. Feel free to borrow the code to create libraries of your own. We are going to be creating a standard set of libraries on GitHub for these component tags. Since the HTML tags are Knockout components, we are calling these libraries "KOmponents". The" resource can be found at https://github.com/sosensible/komponents. Now, with that library included, we will use our View code to connect to the new tag. Here is the code to use in the View: <kom-like approve="tagLike"></kom-like> <span data-bind="text: tagLike"></span> Notice that in our HTML markup, the tag starts with the library prefix. This will also require viewModel to have a binding to pass into this tag as follows: self.tagLike = ko.observable(true); The following is the code for the actual "attribute-aware version" of Knockout components. Do not place this in the code as it is already included in the library in the shared directory: // <kom-like /> tag ko.components.register('kom-like', { viewModel: function(params) {    // Data: value must but true to approve    this.approve = params.approve;    // Behaviors:    this.toggle = function(){      this.approve(!this.approve());    }.bind(this); }, template:    '<div class="approve">      <button data-bind="click: toggle">        <span data-bind="visible: approve" class="glyphicon   glyphicon-thumbs-up"></span>        <span data-bind="visible:! approve()" class="glyphicon   glyphicon-thumbs-down"></span>      </button>    </div>' }); The tag in the View changed as we passed the information in via named attributes and not as a JSON structure inside a param attribute. We also made sure to manage these tags by using a prefix. The reason for this is that we did not want our fancy tags to break the standard method of passing params commonly practiced with regular Knockout components. As we see, again we have another functional component with the added advantage of being able to pass the values in a style more familiar to those used to coding with HTML tags. Building your own libraries Again, we are calling our custom components KOmponents. We will be creating a number of library solutions over time and welcome others to join in. Tags will not do everything for us, as there are some limitations yet to be conquered. That doesn't mean we wait for all the features before doing the ones we can for now. In this article, we will also be showing some tags from our Bootstrap KOmponents library. First we will need to include the Bootstrap KOmponents library: <script src="/share/js/knockout.komponents.bs.js"></script> Above viewModel in our script, we need to add a function to make this section of code simpler. At times, when passing items into observables, we can pass in richer bound data using a function like this. Again, create this function above the viewModel declaration of the script, shown as follows: var listItem = function(display, students){ this.display = ko.observable(display); this.students = ko.observable(students); this.type = ko.computed(function(){    switch(Math.ceil(this.students()/5)){      case 1:      case 2:        return 'danger';        break;      case 3:        return 'warning';        break;      case 4:        return 'info';        break;      default:        return 'success';    } },this); }; Now, inside viewModel, we will declare a set of data to pass to a Bootstrap style listGroup as follows: self.listData = ko.observableArray([ new listItem("HTML5",12), new listItem("CSS",8), new listItem("JavaScript",19), new listItem("jQuery",48), new listItem("Knockout",33) ]); Each item in our array will have display, students, and type variables. We are using a number of features in Bootstrap here but packaging them all up inside our Bootstrap smart tag. This tag starts to go beyond the bare basics. It is still very implementable, but we don't want to throw too much at you to absorb at one time, so we will not go into the detailed code for this tag. What we do want to show is how much power can be wrapped into custom Knockout tags. Here is the markup we will use to call this tag and bind the correct part of viewModel for display: <kom-listgroup data="listData" badgeField="'students'"   typeField="'type'"></kom-listgroup> That is it. You should take note of a couple of special details. The data is passed in as a bound Knockout ViewModel. The badge field is passed in as a string name to declare the field on the data collection where the badge count will be pulled. The same string approach has been used for the type field. The type will set the colors as per standard Bootstrap types. The theme here is that if there are not enough students to hold a class, then it shows the danger color in the list group custom tag. Here is what it looks like in the browser when we run the code: While this is neat, let's jump into our browser tools console and change the value of one of the items. Let's say there was a class on some cool web technology called jQuery. What if people had not heard of it and didn't know what it was and you really wanted to take the class? Well, it would be nice to encourage a few others to check it out. How would you know whether the class was at a danger level or not? Well, we could simply use the badge and the numbers, but how awesome is it to also use the color coding hint? Type the following code into the console and see what changes: vm.listData()[3].display() Because JavaScript starts counting with zero for the first item, we will get the following result: Now we know we have the right item, so let's set the student count to nine using the following code in the browser console: vm.listData()[3].students(9) Notice the change in the jQuery class. Both the badge and the type value have updated. This screenshot of the update shows how much power we can wield with very little manual coding: We should also take a moment to see how the type was managed. Using the functional assignment, we were able to use the Knockout computed binding for that value. Here is the code for that part again: this.type = ko.computed(function(){ switch(Math.ceil(this.students()/5)){    case 1:    case 2:      return 'danger';      break;    case 3:      return 'warning';      break;    case 4:      return 'info';      break;    default:      return 'success'; } },this); While the code is outside the viewModel declaration, it is still able to bind properly to make our code run even inside a custom tag created with Knockout's component binding. Bootstrap component example Here is another example of binding with Bootstrap. The general best practice for using modal display boxes is to place them higher in the code, perhaps under the body tag, to make sure there are no conflicts with the rest of the code. Place this tag right below the body tag as shown in the following code: <kom-modal id="'komModal'" title="komModal.title()"   body="komModal.body()"></kom-modal> Again, we will need to make some declarations inside viewModel for this to work right. Enter this code into the declarations of viewModel: self.komModal = { title: ko.observable('Modal KOMponent'), body: ko.observable('This is the body of the <strong>modal   KOMponent</strong>.') }; We will also create a button on the page to call our viewModel. The button will use the binding that is part of Bootstrap. The data-toggle and data-target attributes are not Knockout binding features. Knockout works side-by-side wonderfully though. Another point of interest is the standard ID attribute, which tells how Bootstrap items, like this button, interact with the modal box. This is another reason it may be beneficial to use KOmponents or a library like it. Here is the markup code: <button type="button" data-toggle="modal" data-   target="#komModal">Open Modal KOmponent</button> When we click on the button, this is the requestor we see: Now, to understand the power of Knockout working with our requestor, head back over to your browser tools console. Enter the following command into the prompt: vm.komModal.body("Wow, live data binding!") The following screenshot shows the change: Who knows what type of creative modular boxes we can build using this type of technology. This brings us closer towards creating what we can imagine. Perhaps it may bring us closer to building some of the wild things our customers imagine. While that may not be your main motivation for using Knockout, it would be nice to have a few less roadblocks when we want to be creative. It would also be nice to have this wonderful ability to package and reuse these solutions across a site without using copy and paste and searching back through the code when the client makes a change to make updates. Again, feel free to look at the file to see how we made these components work. They are not extremely complicated once you get the basics of using Knockout and its components. If you are looking to build components of your own, they will help you get some insight on how to do things inside as you move your skills to the next level. Understanding the AMD approach We are going to look into the concept of what makes an AMD-style website. The point of this approach to sites is to pull content on demand. The content, or modules as they are defined here, does not need to be loaded in a particular order. If there are pieces that depend on other pieces, that is, of course, managed. We will be using the RequireJS library to manage this part of our code. We will create four files in this example, as follows: amd.html amd.config.js pick.js pick.html In our AMD page, we are going to create a configuration file for our RequireJS functionality. That will be the amd.config.js file mentioned in the aforementioned list. We will start by creating this file with the following code: // require.js settings var require = {    baseUrl: ".",    paths: {        "bootstrap":       "/share/js/bootstrap.min",        "jquery":           "/share/js/jquery.min",        "knockout":         "/share/js/knockout",        "text":             "/share/js/text"    },    shim: {        "bootstrap": { deps: ["jquery"] },        "knockout": { deps: ["jquery"] },    } }; We see here that we are creating some alias names and setting the paths these names point to for this page. The file could, of course, be working for more than one page, but in this case, it has specifically been created for a single page. The configuration in RequireJS does not need the .js extension on the file names, as you would have noted. Now, we will look at our amd.html page where we pull things together. We are again using the standard page we have used for this article, which you will notice if you preview the done file example of the code. There are a couple of differences though, because the JavaScript files do not all need to be called at the start. RequireJS handles this well for us. We are not saying this is a standard practice of AMD, but it is an introduction of the concepts. We will need to include the following three script files in this example: <script src="/share/js/knockout.js"></script> <script src="amd.config.js"></script> <script src="/share/js/require.js"></script> Notice that the configuration settings need to be set before calling the require.js library. With that set, we can create the code to wire Knockout binding on the page. This goes in our amd.html script at the bottom of the page: <script> ko.components.register('pick', { viewModel: { require: 'pick' }, template: { require: 'text!pick.html' } }); viewModel = function(){ this.choice = ko.observable(); } vm = new viewModel(); ko.applyBindings(vm); </script> Most of this code should look very familiar. The difference is that the external files are being used to set the content for viewModel and template in the pick component. The require setting smartly knows to include the pick.js file for the pick setting. It does need to be passed as a string, of course. When we include the template, you will see that we use text! in front of the file we are including. We also declare the extension on the file name in this case. The text method actually needs to know where the text is coming from, and you will see in our amd.config.js file that we created an alias for the inclusion of the text function. Now, we will create the pick.js file and place it in the same directory as the amd.html file. It could have been in another directory, and you would have to just set that in the component declaration along with the filename. Here is the code for this part of our AMD component: define(['knockout'], function(ko) {    function LikeWidgetViewModel(params) {        this.chosenValue = params.value;        this.land = Math.round(Math.random()) ? 'heads' : 'tails';    }    LikeWidgetViewModel.prototype.heads = function() {        this.chosenValue('heads');    };    LikeWidgetViewModel.prototype.tails = function() {        this.chosenValue('tails');    };    return LikeWidgetViewModel; }); Notice that our code starts with the define method. This is our AMD functionality in place. It is saying that before we try to execute this section of code we need to make sure the Knockout library is loaded. This allows us to do on-demand loading of code as needed. The code inside the viewModel section is the same as the other examples we have looked at with one exception. We return viewModel as you see at the end of the preceding code. We used the shorthand code to set the value for heads and tails in this example. Now, we will look at our template file, pick.html. This is the code we will have in this file: <div class="like-or-dislike" data-bind="visible: !chosenValue()"> <button data-bind="click: heads">Heads</button> <button data-bind="click: tails">Tails</button> </div> <div class="result" data-bind="visible: chosenValue">    You picked <strong data-bind="text: chosenValue"></strong>    The correct value was <strong data-bind="text:   land"></strong> </div> There is nothing special other than the code needed to make this example work. The goal is to allow a custom tag to offer up heads or tails options on the page. We also pass in a bound variable from viewModel. We will be passing it into three identical tags. The tags are actually going to load the content instantly in this example. The goal is to get familiar with how the code works. We will take it to full practice at the end of the article. Right now, we will put this code in the View segment of our amd.html page: <h2>One Choice</h2> <pick params="value: choice"></pick><br> <pick params="value: choice"></pick><br> <pick params="value: choice"></pick> Notice that we have included the pick tag three times. While we are passing in the bound choice item from viewModel, each tag will randomly choose heads or tails. When we run the code, this is what we will see: Since we passed the same bound item into each of the three tags, when we click on any heads or tails set, it will immediately pass that value out to viewModel, which will in turn immediately pass the value back into the other two tag sets. They are all wired together through viewModel binding being the same variable. This is the result we get if we click on Tails: Well, it is the results we got that time. Actually, the results change pretty much every time we refresh the page. Now, we are ready to do something extra special by combining our AMD approach with Knockout modules. Summary This article has shown the awesome power of templates working together with ViewModels within Knockout components. You should now have an awesome foundation to do more with less than ever before. You should know how to mingle your jQuery code with the Knockout code side by side. To review, in this article, we learned what Knockout components are. We learned how to use the components to create custom HTML elements that are interactive and powerful. We learned how to enhance custom elements to allow variables to be managed using the more common attributes approach. We learned how to use an AMD-style approach to coding with Knockout. We also learned how to AJAX everything and integrate jQuery to enhance Knockout-based solutions. What's next? That is up to you. One thing is for sure, the possibilities are broader using Knockout than they were before. Happy coding and congratulations on completing your study of KnockoutJS! Resources for Article: Further resources on this subject: Top features of KnockoutJS [article] Components [article] Web Application Testing [article]
Read more
  • 0
  • 0
  • 1776

article-image-starting-small-and-growing-modular-way
Packt
02 Mar 2015
27 min read
Save for later

Starting Small and Growing in a Modular Way

Packt
02 Mar 2015
27 min read
This article written by Carlo Russo, author of the book KnockoutJS Blueprints, describes that RequireJS gives us a simplified format to require many parameters and to avoid parameter mismatch using the CommonJS require format; for example, another way (use this or the other one) to write the previous code is: define(function(require) {   var $ = require("jquery"),       ko = require("knockout"),       viewModel = {};   $(function() {       ko.applyBindings(viewModel);   });}); (For more resources related to this topic, see here.) In this way, we skip the dependencies definition, and RequireJS will add all the texts require('xxx') found in the function to the dependency list. The second way is better because it is cleaner and you cannot mismatch dependency names with named function arguments. For example, imagine you have a long list of dependencies; you add one or remove one, and you miss removing the relative function parameter. You now have a hard-to-find bug. And, in case you think that r.js optimizer behaves differently, I just want to assure you that it's not so; you can use both ways without any concern regarding optimization. Just to remind you, you cannot use this form if you want to load scripts dynamically or by depending on variable value; for example, this code will not work: var mod = require(someCondition ? "a" : "b");if (someCondition) {   var a = require('a');} else {   var a = require('a1');} You can learn more about this compatibility problem at this URL: http://www.requirejs.org/docs/whyamd.html#commonjscompat. You can see more about this sugar syntax at this URL: http://www.requirejs.org/docs/whyamd.html#sugar. Now that you know the basic way to use RequireJS, let's look at the next concept. Component binding handler The component binding handler is one of the new features introduced in Version 2.3 of KnockoutJS. Inside the documentation of KnockoutJS, we find the following explanation: Components are a powerful, clean way of organizing your UI code into self-contained, reusable chunks. They can represent individual controls/widgets, or entire sections of your application. A component is a combination of HTML and JavaScript. The main idea behind their inclusion was to create full-featured, reusable components, with one or more points of extensibility. A component is a combination of HTML and JavaScript. There are cases where you can use just one of them, but normally you'll use both. You can get a first simple example about this here: http://knockoutjs.com/documentation/component-binding.html. The best way to create self-contained components is with the use of an AMD module loader, such as RequireJS; put the View Model and the template of the component inside two different files, and then you can use it from your code really easily. Creating the bare bones of a custom module Writing a custom module of KnockoutJS with RequireJS is a 4-step process: Creating the JavaScript file for the View Model. Creating the HTML file for the template of the View. Registering the component with KnockoutJS. Using it inside another View. We are going to build bases for the Search Form component, just to move forward with our project; anyway, this is the starting code we should use for each component that we write from scratch. Let's cover all of these steps. Creating the JavaScript file for the View Model We start with the View Model of this component. Create a new empty file with the name BookingOnline/app/components/search.js and put this code inside it: define(function(require) {var ko = require("knockout"),     template = require("text!./search.html");function Search() {}return {   viewModel: Search,   template: template};}); Here, we are creating a constructor called Search that we will fill later. We are also using the text plugin for RequireJS to get the template search.html from the current folder, into the argument template. Then, we will return an object with the constructor and the template, using the format needed from KnockoutJS to use as a component. Creating the HTML file for the template of the View In the View Model we required a View called search.html in the same folder. At the moment, we don't have any code to put inside the template of the View, because there is no boilerplate code needed; but we must create the file, otherwise RequireJS will break with an error. Create a new file called BookingOnline/app/components/search.html with the following content: <div>Hello Search</div> Registering the component with KnockoutJS When you use components, there are two different ways to give KnockoutJS a way to find your component: Using the function ko.components.register Implementing a custom component loader The first way is the easiest one: using the default component loader of KnockoutJS. To use it with our component you should just put the following row inside the BookingOnline/app/index.js file, just before the row $(function () {: ko.components.register("search", {require: "components/search"}); Here, we are registering a module called search, and we are telling KnockoutJS that it will have to find all the information it needs using an AMD require for the path components/search (so it will load the file BookingOnline/app/components/search.js). You can find more information and a really good example about a custom component loader at: http://knockoutjs.com/documentation/component-loaders.html#example-1-a-component-loader-that-sets-up-naming-conventions. Using it inside another View Now, we can simply use the new component inside our View; put the following code inside our Index View (BookingOnline/index.html), before the script tag:    <div data-bind="component: 'search'"></div> Here, we are using the component binding handler to use the component; another commonly used way is with custom elements. We can replace the previous row with the following one:    <search></search> KnockoutJS will use our search component, but with a WebComponent-like code. If you want to support IE6-8 you should register the WebComponents you are going to use before the HTML parser can find them. Normally, this job is done inside the ko.components.register function call, but, if you are putting your script tag at the end of body as we have done until now, your WebComponent will be discarded. Follow the guidelines mentioned here when you want to support IE6-8: http://knockoutjs.com/documentation/component-custom-elements.html#note-custom-elements-and-internet-explorer-6-to-8 Now, you can open your web application and you should see the text, Hello Search. We put that markup only to check whether everything was working here, so you can remove it now. Writing the Search Form component Now that we know how to create a component, and we put the base of our Search Form component, we can try to look for the requirements for this component. A designer will review the View later, so we need to keep it simple to avoid the need for multiple changes later. From our analysis, we find that our competitors use these components: Autocomplete field for the city Calendar fields for check-in and check-out Selection field for the number of rooms, number of adults and number of children, and age of children This is a wireframe of what we should build (we got inspired by Trivago): We could do everything by ourselves, but the easiest way to realize this component is with the help of a few external plugins; we are already using jQuery, so the most obvious idea is to use jQuery UI to get the Autocomplete Widget, the Date Picker Widget, and maybe even the Button Widget. Adding the AMD version of jQuery UI to the project Let's start downloading the current version of jQuery UI (1.11.1); the best thing about this version is that it is one of the first versions that supports AMD natively. After reading the documentation of jQuery UI for the AMD (URL: http://learn.jquery.com/jquery-ui/environments/amd/) you may think that you can get the AMD version using the download link from the home page. However, if you try that you will get just a package with only the concatenated source; for this reason, if you want the AMD source file, you will have to go directly to GitHub or use Bower. Download the package from https://github.com/jquery/jquery-ui/archive/1.11.1.zip and extract it. Every time you use an external library, remember to check the compatibility support. In jQuery UI 1.11.1, as you can see in the release notes, they removed the support for IE7; so we must decide whether we want to support IE6 and 7 by adding specific workarounds inside our code, or we want to remove the support for those two browsers. For our project, we need to put the following folders into these destinations: jquery-ui-1.11.1/ui -> BookingOnline/app/ui jquery-ui-1.11.1/theme/base -> BookingOnline/css/ui We are going to apply the widget by JavaScript, so the only remaining step to integrate jQuery UI is the insertion of the style sheet inside our application. We do this by adding the following rows to the top of our custom style sheet file (BookingOnline/css/styles.css): @import url("ui/core.css");@import url("ui/menu.css");@import url("ui/autocomplete.css");@import url("ui/button.css");@import url("ui/datepicker.css");@import url("ui/theme.css") Now, we are ready to add the widgets to our web application. You can find more information about jQuery UI and AMD at: http://learn.jquery.com/jquery-ui/environments/amd/ Making the skeleton from the wireframe We want to give to the user a really nice user experience, but as the first step we can use the wireframe we put before to create a skeleton of the Search Form. Replace the entire content with a form inside the file BookingOnline/components/search.html: <form data-bind="submit: execute"></form> Then, we add the blocks inside the form, step by step, to realize the entire wireframe: <div>   <input type="text" placeholder="Enter a destination" />   <label> Check In: <input type="text" /> </label>   <label> Check Out: <input type="text" /> </label>   <input type="submit" data-bind="enable: isValid" /></div> Here, we built the first row of the wireframe; we will bind data to each field later. We bound the execute function to the submit event (submit: execute), and a validity check to the button (enable: isValid); for now we will create them empty. Update the View Model (search.js) by adding this code inside the constructor: this.isValid = ko.computed(function() {return true;}, this); And add this function to the Search prototype: Search.prototype.execute = function() { }; This is because the validity of the form will depend on the status of the destination field and of the check-in date and check-out date; we will update later, in the next paragraphs. Now, we can continue with the wireframe, with the second block. Here, we should have a field to select the number of rooms, and a block for each room. Add the following markup inside the form, after the previous one, for the second row to the View (search.html): <div>   <fieldset>     <legend>Rooms</legend>     <label>       Number of Room       <select data-bind="options: rangeOfRooms,                           value: numberOfRooms">       </select>     </label>     <!-- ko foreach: rooms -->       <fieldset>         <legend>           Room <span data-bind="text: roomNumber"></span>         </legend>       </fieldset>     <!-- /ko -->   </fieldset></div> In this markup we are asking the user to choose between the values found inside the array rangeOfRooms, to save the selection inside a property called numberOfRooms, and to show a frame for each room of the array rooms with the room number, roomNumber. When developing and we want to check the status of the system, the easiest way to do it is with a simple item inside a View bound to the JSON of a View Model. Put the following code inside the View (search.html): <pre data-bind="text: ko.toJSON($data, null, 2)"></pre> With this code, you can check the status of the system with any change directly in the printed JSON. You can find more information about ko.toJSON at http://knockoutjs.com/documentation/json-data.html Update the View Model (search.js) by adding this code inside the constructor: this.rooms = ko.observableArray([]);this.numberOfRooms = ko.computed({read: function() {   return this.rooms().length;},write: function(value) {   var previousValue = this.rooms().length;   if (value > previousValue) {     for (var i = previousValue; i < value; i++) {       this.rooms.push(new Room(i + 1));     }   } else {     this.rooms().splice(value);     this.rooms.valueHasMutated();   }},owner: this}); Here, we are creating the array of rooms, and a property to update the array properly. If the new value is bigger than the previous value it adds to the array the missing item using the constructor Room; otherwise, it removes the exceeding items from the array. To get this code working we have to create a module, Room, and we have to require it here; update the require block in this way:    var ko = require("knockout"),       template = require("text!./search.html"),       Room = require("room"); Also, add this property to the Search prototype: Search.prototype.rangeOfRooms = ko.utils.range(1, 10); Here, we are asking KnockoutJS for an array with the values from the given range. ko.utils.range is a useful method to get an array of integers. Internally, it simply makes an array from the first parameter to the second one; but if you use it inside a computed field and the parameters are observable, it re-evaluates and updates the returning array. Now, we have to create the View Model of the Room module. Create a new file BookingOnline/app/room.js with the following starting code: define(function(require) {var ko = require("knockout");function Room(roomNumber) {   this.roomNumber = roomNumber;}return Room;}); Now, our web application should appear like so: As you can see, we now have a fieldset for each room, so we can work on the template of the single room. Here, you can also see in action the previous tip about the pre field with the JSON data. With KnockoutJS 3.2 it is harder to decide when it's better to use a normal template or a component. The rule of thumb is to identify the degree of encapsulation you want to manage: Use the component when you want a self-enclosed black box, or the template if you want to manage the View Model directly. What we want to show for each room is: Room number Number of adults Number of children Age of each child We can update the Room View Model (room.js) by adding this code into the constructor: this.numberOfAdults = ko.observable(2);this.ageOfChildren = ko.observableArray([]);this.numberOfChildren = ko.computed({read: function() {   return this.ageOfChildren().length;},write: function(value) {   var previousValue = this.ageOfChildren().length;   if (value > previousValue) {     for (var i = previousValue; i < value; i++) {       this.ageOfChildren.push(ko.observable(0));     }   } else {     this.ageOfChildren().splice(value);     this.ageOfChildren.valueHasMutated();   }},owner: this});this.hasChildren = ko.computed(function() {return this.numberOfChildren() > 0;}, this); We used the same logic we have used before for the mapping between the count of the room and the count property, to have an array of age of children. We also created a hasChildren property to know whether we have to show the box for the age of children inside the View. We have to add—as we have done before for the Search View Model—a few properties to the Room prototype: Room.prototype.rangeOfAdults = ko.utils.range(1, 10);Room.prototype.rangeOfChildren = ko.utils.range(0, 10);Room.prototype.rangeOfAge = ko.utils.range(0, 17); These are the ranges we show inside the relative select. Now, as the last step, we have to put the template for the room in search.html; add this code inside the fieldset tag, after the legend tag (as you can see here, with the external markup):      <fieldset>       <legend>         Room <span data-bind="text: roomNumber"></span>       </legend>       <label> Number of adults         <select data-bind="options: rangeOfAdults,                            value: numberOfAdults"></select>       </label>       <label> Number of children         <select data-bind="options: rangeOfChildren,                             value: numberOfChildren"></select>       </label>       <fieldset data-bind="visible: hasChildren">         <legend>Age of children</legend>         <!-- ko foreach: ageOfChildren -->           <select data-bind="options: $parent.rangeOfAge,                               value: $rawData"></select>         <!-- /ko -->       </fieldset>     </fieldset>     <!-- /ko --> Here, we are using the properties we have just defined. We are using rangeOfAge from $parent because inside foreach we changed context, and the property, rangeOfAge, is inside the Room context. Why did I use $rawData to bind the value of the age of the children instead of $data? The reason is that ageOfChildren is an array of observables without any container. If you use $data, KnockoutJS will unwrap the observable, making it one-way bound; but if you use $rawData, you will skip the unwrapping and get the two-way data binding we need here. In fact, if we use the one-way data binding our model won't get updated at all. If you really don't like that the fieldset for children goes to the next row when it appears, you can change the fieldset by adding a class, like this: <fieldset class="inline" data-bind="visible: hasChildren"> Now, your application should appear as follows: Now that we have a really nice starting form, we can update the three main fields to use the jQuery UI Widgets. Realizing an Autocomplete field for the destination As soon as we start to write the code for this field we face the first problem: how can we get the data from the backend? Our team told us that we don't have to care about the backend, so we speak to the backend team to know how to get the data. After ten minutes we get three files with the code for all the calls to the backend; all we have to do is to download these files (we already got them with the Starting Package, to avoid another download), and use the function getDestinationByTerm inside the module, services/rest. Before writing the code for the field let's think about which behavior we want for it: When you put three or more letters, it will ask the server for the list of items Each recurrence of the text inside the field into each item should be bold When you select an item, a new button should appear to clear the selection If the current selected item and the text inside the field are different when the focus exits from the field, it should be cleared The data should be taken using the function, getDestinationByTerm, inside the module, services/rest The documentation of KnockoutJS also explains how to create custom binding handlers in the context of RequireJS. The what and why about binding handlers All the bindings we use inside our View are based on the KnockoutJS default binding handler. The idea behind a binding handler is that you should put all the code to manage the DOM inside a component different from the View Model. Other than this, the binding handler should be realized with reusability in mind, so it's always better not to hard-code application logic inside. The KnockoutJS documentation about standard binding is already really good, and you can find many explanations about its inner working in the Appendix, Binding Handler. When you make a custom binding handler it is important to remember that: it is your job to clean after; you should register event handling inside the init function; and you should use the update function to update the DOM depending on the change of the observables. This is the standard boilerplate code when you use RequireJS: define(function(require) {var ko = require("knockout"),     $ = require("jquery");ko.bindingHandlers.customBindingHandler = {   init: function(element, valueAccessor,                   allBindingsAccessor, data, context) {     /* Code for the initialization… */     ko.utils.domNodeDisposal.addDisposeCallback(element,       function () { /* Cleaning code … */ });   },   update: function (element, valueAccessor) {     /* Code for the update of the DOM… */   }};}); And inside the View Model module you should require this module, as follows: require('binding-handlers/customBindingHandler'); ko.utils.domNodeDisposal is a list of callbacks to be executed when the element is removed from the DOM; it's necessary because it's where you have to put the code to destroy the widgets, or remove the event handlers. Binding handler for the jQuery Autocomplete widget So, now we can write our binding handler. We will define a binding handler named autocomplete, which takes the observable to put the found value. We will also define two custom bindings, without any logic, to work as placeholders for the parameters we will send to the main binding handler. Our binding handler should: Get the value for the autoCompleteOptions and autoCompleteEvents optional data bindings. Apply the Autocomplete Widget to the item using the option of the previous step. Register all the event listeners. Register the disposal of the Widget. We also should ensure that if the observable gets cleared, the input field gets cleared too. So, this is the code of the binding handler to put inside BookingOnline/app/binding-handlers/autocomplete.js (I put comments between the code to make it easier to understand): define(function(require) {var ko = require("knockout"),     $ = require("jquery"),     autocomplete = require("ui/autocomplete");ko.bindingHandlers.autoComplete = {   init: function(element, valueAccessor, allBindingsAccessor, data, context) { Here, we are giving the name autoComplete to the new binding handler, and we are also loading the Autocomplete Widget of jQuery UI: var value = ko.utils.unwrapObservable(valueAccessor()),   allBindings = ko.utils.unwrapObservable(allBindingsAccessor()),   options = allBindings.autoCompleteOptions || {},   events = allBindings.autoCompleteEvents || {},   $element = $(element); Then, we take the data from the binding for the main parameter, and for the optional binding handler; we also put the current element into a jQuery container: autocomplete(options, $element);if (options._renderItem) {   var widget = $element.autocomplete("instance");   widget._renderItem = options._renderItem;}for (var event in events) {   ko.utils.registerEventHandler(element, event, events[event]);} Now we can apply the Autocomplete Widget to the field. If you are questioning why we used ko.utils.registerEventHandler here, the answer is: to show you this function. If you look at the source, you can see that under the wood it uses $.bind if jQuery is registered; so in our case we could simply use $.bind or $.on without any problem. But I wanted to show you this function because sometimes you use KnockoutJS without jQuery, and you can use it to support event handling of every supported browser. The source code of the function _renderItem is (looking at the file ui/autocomplete.js): _renderItem: function( ul, item ) {return $( "<li>" ).text( item.label ).appendTo( ul );}, As you can see, for security reasons, it uses the function text to avoid any possible code injection. It is important that you know that you should do data validation each time you get data from an external source and put it in the page. In this case, the source of data is already secured (because we manage it), so we override the normal behavior, to also show the HTML tag for the bold part of the text. In the last three rows we put a cycle to check for events and we register them. The standard way to register for events is with the event binding handler. The only reason you should use a custom helper is to give to the developer of the View a way to register events more than once. Then, we add to the init function the disposal code: // handle disposalko.utils.domNodeDisposal.addDisposeCallback(element, function() {$element.autocomplete("destroy");}); Here, we use the destroy function of the widget. It's really important to clean up after the use of any jQuery UI Widget or you'll create a really bad memory leak; it's not a big problem with simple applications, but it will be a really big problem if you realize an SPA. Now, we can add the update function:    },   update: function(element, valueAccessor) {     var value = valueAccessor(),         $element = $(element),         data = value();     if (!data)       $element.val("");   }};}); Here, we read the value of the observable, and clean the field if the observable is empty. The update function is executed as a computed observable, so we must be sure that we subscribe to the observables required inside. So, pay attention if you put conditional code before the subscription, because your update function could be not called anymore. Now that the binding is ready, we should require it inside our form; update the View search.html by modifying the following row:    <input type="text" placeholder="Enter a destination" /> Into this:    <input type="text" placeholder="Enter a destination"           data-bind="autoComplete: destination,                     autoCompleteEvents: destination.events,                     autoCompleteOptions: destination.options" /> If you try the application you will not see any error; the reason is that KnockoutJS ignores any data binding not registered inside the ko.bindingHandlers object, and we didn't require the binding handler autocomplete module. So, the last step to get everything working is the update of the View Model of the component; add these rows at the top of the search.js, with the other require(…) rows:      Room = require("room"),     rest = require("services/rest");require("binding-handlers/autocomplete"); We need a reference to our new binding handler, and a reference to the rest object to use it as source of data. Now, we must declare the properties we used inside our data binding; add all these properties to the constructor as shown in the following code: this.destination = ko.observable();this.destination.options = { minLength: 3,source: rest.getDestinationByTerm,select: function(event, data) {   this.destination(data.item);}.bind(this),_renderItem: function(ul, item) {   return $("<li>").append(item.label).appendTo(ul);}};this.destination.events = {blur: function(event) {   if (this.destination() && (event.currentTarget.value !==                               this.destination().value)) {     this.destination(undefined);   }}.bind(this)}; Here, we are defining the container (destination) for the data selected inside the field, an object (destination.options) with any property we want to pass to the Autocomplete Widget (you can check all the documentation at: http://api.jqueryui.com/autocomplete/), and an object (destination.events) with any event we want to apply to the field. Here, we are clearing the field if the text inside the field and the content of the saved data (inside destination) are different. Have you noticed .bind(this) in the previous code? You can check by yourself that the value of this inside these functions is the input field. As you can see, in our code we put references to the destination property of this, so we have to update the context to be the object itself; the easiest way to do this is with a simple call to the bind function. Summary In this article, we have seen all some functionalities of KnockoutJS (core). The application we realized was simple enough, but we used it to learn better how to use components and custom binding handlers. If you think we put too much code for such a small project, try to think what differences you have seen between the first and the second component: the more component and binding handler code you write, the lesser you will have to write in the future. The most important point about components and custom binding handlers is that you have to realize them looking at future reuse; more good code you write, the better it will be for you later. The core point of this article was AMD and RequireJS; how to use them inside a KnockoutJS project, and why you should do it. Resources for Article: Further resources on this subject: Components [article] Web Application Testing [article] Top features of KnockoutJS [article] e to add—as we have done before for the Search View Model—  
Read more
  • 0
  • 0
  • 2180
article-image-model-view-viewmodel
Packt
02 Mar 2015
24 min read
Save for later

Model-View-ViewModel

Packt
02 Mar 2015
24 min read
In this article, by Einar Ingebrigtsen, author of the book, SignalR Blueprints, we will focus on a different programming model for client development: Model-View-ViewModel (MVVM). It will reiterate what you have already learned about SignalR, but you will also start to see a recurring theme in how you should architect decoupled software that adheres to the SOLID principles. It will also show the benefit of thinking in single page application terms (often referred to as Single Page Application (SPA)), and how SignalR really fits well with this idea. (For more resources related to this topic, see here.) The goal – an imagined dashboard A counterpart to any application is often a part of monitoring its health. Is it running? and are there any failures?. Getting this information in real time when the failure occurs is important and also getting some statistics from it is interesting. From a SignalR perspective, we will still use the hub abstraction to do pretty much what we have been doing, but the goal is to give ideas of how and what we can use SignalR for. Another goal is to dive into the architectural patterns, making it ready for larger applications. MVVM allows better separation and is very applicable for client development in general. A question that you might ask yourself is why KnockoutJS instead of something like AngularJS? It boils down to the personal preference to a certain degree. AngularJS is described as a MVW where W stands for Whatever. I find AngularJS less focused on the same things I focus on and I also find it very verbose to get it up and running. I'm not in any way an expert in AngularJS, but I have used it on a project and I found myself writing a lot to make it work the way I wanted it to in terms of MVVM. However, I don't think it's fair to compare the two. KnockoutJS is very focused in what it's trying to solve, which is just a little piece of the puzzle, while AngularJS is a full client end-to-end framework. On this note, let's just jump straight to it. Decoupling it all MVVM is a pattern for client development that became very popular in the XAML stack, enabled by Microsoft based on Martin Fowlers presentation model. Its principle is that you have a ViewModel that holds the state and exposes behavior that can be utilized from a view. The view observes any changes of the state the ViewModel exposes, making the ViewModel totally unaware that there is a view. The ViewModel is decoupled and can be put in isolation and is perfect for automated testing. As part of the state that the ViewModel typically holds is the model part, which is something it usually gets from the server, and a SignalR hub is the perfect transport to get this. It boils down to recognizing the different concerns that make up the frontend and separating it all. This gives us the following diagram: Back to basics This time we will go back in time, going down what might be considered a more purist path; use the browser elements (HTML, JavaScript, and CSS) and don't rely on any server-side rendering. Clients today are powerful and very capable and offloading the composition of what the user sees onto the client frees up server resources. You can also rely on the infrastructure of the Web for caching with static HTML files not rendered by the server. In fact, you could actually put these resources on a content delivery network, making the files available as close as possible to the end user. This would result in better load times for the user. You might have other reasons to perform server-side rendering and not just plain HTML. Leveraging existing infrastructure or third-party party tools could be those reasons. It boils down to what's right for you. But this particular sample will focus on things that the client can do. Anyways, let's get started. Open Visual Studio and create a new project by navigating to FILE | New | Project. The following dialog box will show up: From the left-hand side menu, select Web and then ASP.NET Web Application. Enter Chapter4 in the Name textbox and select your location. Select the Empty template from the template selector and make sure you deselect the Host in the cloud option. Then, click on OK, as shown in the following screenshot: Setting up the packages First, we want Twitter bootstrap. To get this, follow these steps: Add a NuGet package reference. Right-click on References in Solution Explorer and select Manage NuGet Packages and type Bootstrap in the search dialog box. Select it and then click on Install. We want a slightly different look, so we'll download one of the many bootstrap themes out here. Add a NuGet package reference called metro-bootstrap. As jQuery is still a part of this, let's add a NuGet package reference to it as well. For the MVVM part, we will use something called KnockoutJS; add it through NuGet as well. Add a NuGet package reference, as in the previous steps, but this time, type SignalR in the search dialog box. Find the package called Microsoft ASP.NET SignalR. Making any SignalR hubs available for the client Add a file called Startup.cs file to the root of the project. Add a Configuration method that will expose any SignalR hubs, as follows: public void Configuration(IAppBuilder app) { app.MapSignalR(); } At the top of the Startup.cs file, above the namespace declaration, but right below the using statements, add the following code:  [assembly: OwinStartupAttribute(typeof(Chapter4.Startup))] Knocking it out of the park KnockoutJS is a framework that implements a lot of the principles found in MVVM and makes it easier to apply. We're going to use the following two features of KnockoutJS, and it's therefore important to understand what they are and what significance they have: Observables: In order for a view to be able to know when state change in a ViewModel occurs, KnockoutJS has something called an observable for single objects or values and observable array for arrays. BindingHandlers: In the view, the counterparts that are able to recognize the observables and know how to deal with its content are known as BindingHandlers. We create binding expression in the view that instructs the view to get its content from the properties found in the binding context. The default binding context will be the ViewModel, but there are more advanced scenarios where this changes. In fact, there is a BindingHandler that enables you to specify the context at any given time called with. Our single page Whether one should strive towards having an SPA is widely discussed on the Web these days. My opinion on the subject, in the interest of the user, is that we should really try to push things in this direction. Having not to post back and cause a full reload of the page and all its resources and getting into the correct state gives the user a better experience. Some of the arguments to perform post-backs every now and then go in the direction of fixing potential memory leaks happening in the browser. Although, the technique is sound and the result is right, it really just camouflages a problem one has in the system. However, as with everything, it really depends on the situation. At the core of an SPA is a single page (pun intended), which is usually the index.html file sitting at the root of the project. Add the new index.html file and edit it as follows: Add a new HTML file (index.html) at the root of the project by right- clicking on the Chapter4 project in Solution Explorer. Navigate to Add | New Item | Web from the left-hand side menu, and then select HTML Page and name it index.html. Finally, click on Add. Let's put in the things we've added dependencies to, starting with the style sheets. In the index.html file, you'll find the <head> tag; add the following code snippet under the <title></title> tag: <link href="Content/bootstrap.min.css" rel="stylesheet" /> <link href="Content/metro-bootstrap.min.css" rel="stylesheet" /> Next, add the following code snippet right beneath the preceding code: <script type="text/javascript" src="Scripts/jquery- 1.9.0.min.js"></script> <script type="text/javascript" src="Scripts/jquery.signalR- 2.1.1.js"></script> <script type="text/javascript" src="signalr/hubs"></script> <script type="text/javascript" src="Scripts/knockout- 3.2.0.js"></script> Another thing we will need in this is something that helps us visualize things; Google has a free, open source charting library that we will use. We will take a dependency to the JavaScript APIs from Google. To do this, add the following script tag after the others: <script type="text/javascript" src="https://www.google.com/jsapi"></script> Now, we can start filling in the view part. Inside the <body> tag, we start by putting in a header, as shown here: <div class="navbar navbar-default navbar-static-top bsnavbar">     <div class="container">         <div class="navbar-header">             <h1>My Dashboard</h1>         </div>     </div> </div> The server side of things In this little dashboard thing, we will look at web requests, both successful and failed. We will perform some minor things for us to be able to do this in a very naive way, without having to flesh out a full mechanism to deal with error situations. Let's start by enabling all requests even static resources, such as HTML files, to run through all HTTP modules. A word of warning: there are performance implications of putting all requests through the managed pipeline, so normally, you wouldn't necessarily want to do this on a production system, but for this sample, it will be fine to show the concepts. Open Web.config in the project and add the following code snippet within the <configuration> tag: <system.webServer>   <modules runAllManagedModulesForAllRequests="true" /> </system.webServer> The hub In this sample, we will only have one hub, the one that will be responsible for dealing with reporting requests and failed requests. Let's add a new class called RequestStatisticsHub. Right-click on the project in Solution Explorer, select Class from Add, name it RequestStatisticsHub.cs, and then click on Add. The new class should inherit from the hub. Add the following using statement at the top: using Microsoft.AspNet.SignalR; We're going to keep a track of the count of requests and failed requests per time with a resolution of not more than every 30 seconds in the memory on the server. Obviously, if one wants to scale across multiple servers, this is way too naive and one should choose an out-of-process shared key-value store that goes across servers. However, for our purpose, this will be fine. Let's add a using statement at the top, as shown here: using System.Collections.Generic; At the top of the class, add the two dictionaries that we will use to hold this information: static Dictionary<string, int> _requestsLog = new Dictionary<string, int>(); static Dictionary<string, int> _failedRequestsLog = new Dictionary<string, int>(); In our client, we want to access these logs at startup. So let's add two methods to do so: public Dictionary<string, int> GetRequests() {     return _requestsLog; }   public Dictionary<string, int> GetFailedRequests() {     return _failedRequestsLog; } Remember the resolution of only keeping track of number of requests per 30 seconds at a time. There is no default mechanism in the .NET Framework to do this so we need to add a few helper methods to deal with rounding of time. Let's add a class called DateTimeRounding at the root of the project. Mark the class as a public static class and put the following extension methods in the class: public static DateTime RoundUp(this DateTime dt, TimeSpan d) {     var delta = (d.Ticks - (dt.Ticks % d.Ticks)) % d.Ticks;     return new DateTime(dt.Ticks + delta); }   public static DateTime RoundDown(this DateTime dt, TimeSpan d) {     var delta = dt.Ticks % d.Ticks;     return new DateTime(dt.Ticks - delta); }   public static DateTime RoundToNearest(this DateTime dt, TimeSpan d) {     var delta = dt.Ticks % d.Ticks;     bool roundUp = delta > d.Ticks / 2;       return roundUp ? dt.RoundUp(d) : dt.RoundDown(d); } Let's go back to the RequestStatisticsHub class and add some more functionality now so that we can deal with rounding of time: static void Register(Dictionary<string, int> log, Action<dynamic, string, int> hubCallback) {     var now = DateTime.Now.RoundToNearest(TimeSpan.FromSeconds(30));     var key = now.ToString("HH:mm");       if (log.ContainsKey(key))         log[key] = log[key] + 1;     else         log[key] = 1;       var hub = GlobalHost.ConnectionManager.GetHubContext<RequestStatisticsHub>() ;     hubCallback(hub.Clients.All, key, log[key]); }   public static void Request() {     Register(_requestsLog, (hub, key, value) => hub.requestCountChanged(key, value)); }   public static void FailedRequest() {     Register(_requestsLog, (hub, key, value) => hub.failedRequestCountChanged(key, value)); } This enables us to have a place to call in order to report requests and these get published back to any clients connected to this particular hub. Note the usage of GlobalHost and its ConnectionManager property. When we want to get a hub instance and when we are not in the hub context of a method being called from a client, we use ConnectionManager to get it. It gives is a proxy for the hub and enables us to call methods on any connected client. Naively dealing with requests With all this in place, we will be able to easily and naively deal with what we consider correct and failed requests. Let's add a Global.asax file by right-clicking on the project in Solution Explorer and select the New item from the Add. Navigate to Web and find Global Application Class, then click on Add. In the new file, we want to replace the BindingHandlers method with the following code snippet: protected void Application_AuthenticateRequest(object sender, EventArgs e) {     var path = HttpContext.Current.Request.Path;     if (path == "/") path = "index.html";       if (path.ToLowerInvariant().IndexOf(".html") < 0) return;       var physicalPath = HttpContext.Current.Request.MapPath(path);     if (File.Exists(physicalPath))     {         RequestStatisticsHub.Request();     }     else     {         RequestStatisticsHub.FailedRequest();     } } Basically, with this, we are only measuring requests with .html in its path, and if it's only "/", we assume it's "index.html". Any file that does not exist, accordingly, is considered an error; typically a 404 error and we register it as a failed request. Bringing it all back to the client With the server taken care of, we can start consuming all this in the client. We will now be heading down the path of creating a ViewModel and hook everything up. ViewModel Let's start by adding a JavaScript file sitting next to our index.html file at the root level of the project, call it index.js. This file will represent our ViewModel. Also, this scenario will be responsible to set up KnockoutJS, so that the ViewModel is in fact activated and applied to the page. As we only have this one page for this sample, this will be fine. Let's start by hooking up the jQuery document that is ready: $(function() { }); Inside the function created here, we will enter our viewModel definition, which will start off being an empty one: var viewModel = function() { }; KnockoutJS has a function to apply a viewModel to the document, meaning that the document or body will be associated with the viewModel instance given. Right under the definition of viewModel, add the following line: ko.applyBindings(new viewModel()); Compiling this and running it should at the very least not give you any errors but nothing more than a header saying My Dashboard. So, we need to lighten this up a bit. Inside the viewModel function definition, add the following code snippet: var self = this; this.requests = ko.observableArray(); this.failedRequests = ko.observableArray(); We enter a reference to this as a variant called self. This will help us with scoping issues later on. The arrays we added are now KnockoutJS's observable arrays that allows the view or any BindingHandler to observe the changes that are coming in. The ko.observableArray() and ko.observable() arrays both return a new function. So, if you want to access any values in it, you must unwrap it by calling it something that might seem counterintuitive at first. You might consider your variable as just another property. However, for the observableArray(), KnockoutJS adds most of the functions found in the array type in JavaScript and they can be used directly on the function without unwrapping. If you look at a variable that is an observableArray in the console of the browser, you'll see that it looks as if it actually is just any array. This is not really true though; to get to the values, you will have to unwrap it by adding () after accessing the variable. However, all the functions you're used to having on an array are here. Let's add a function that will know how to handle an entry into the viewModel function. An entry coming in is either an existing one or a new one; the key of the entry is the giveaway to decide: function handleEntry(log, key, value) {     var result = log().forEach(function (entry) {         if (entry[0] == key) {             entry[1](value);             return true;         }     });       if (result !== true) {         log.push([key, ko.observable(value)]);     } }; Let's set up the hub and add the following code to the viewModel function: var hub = $.connection.requestStatisticsHub; var initializedCount = 0;   hub.client.requestCountChanged = function (key, value) {     if (initializedCount < 2) return;     handleEntry(self.requests, key, value); }   hub.client.failedRequestCountChanged = function (key, value) {     if (initializedCount < 2) return;     handleEntry(self.failedRequests, key, value); } You might notice the initalizedCount variable. Its purpose is not to deal with requests until completely initialized, which comes next. Add the following code snippet to the viewModel function: $.connection.hub.start().done(function () {     hub.server.getRequests().done(function (requests) {         for (var property in requests) {             handleEntry(self.requests, property, requests[property]);         }           initializedCount++;     });     hub.server.getFailedRequests().done(function (requests) {         for (var property in requests) {             handleEntry(self.failedRequests, property, requests[property]);         }           initializedCount++;     }); }); We should now have enough logic in our viewModel function to actually be able to get any requests already sitting there and also respond to new ones coming. BindingHandler The key element of KnockoutJS is its BindingHandler mechanism. In KnockoutJS, everything starts with a data-bind="" attribute on an element in the HTML view. Inside the attribute, one puts binding expressions and the BindingHandlers are a key to this. Every expression starts with the name of the handler. For instance, if you have an <input> tag and you want to get the value from the input into a property on the ViewModel, you would use the BindingHandler value. There are a few BindingHandlers out of the box to deal with the common scenarios (text, value for each, and more). All of the BindingHandlers are very well documented on the KnockoutJS site. For this sample, we will actually create our own BindingHandler. KnockoutJS is highly extensible and allows you to do just this amongst other extensibility points. Let's add a JavaScript file called googleCharts.js at the root of the project. Inside it, add the following code: google.load('visualization', '1.0', { 'packages': ['corechart'] }); This will tell the Google API to enable the charting package. The next thing we want to do is to define the BindingHandler. Any handler has the option of setting up an init function and an update function. The init function should only occur once, when it's first initialized. Actually, it's when the binding context is set. If the parent binding context of the element changes, it will be called again. The update function will be called whenever there is a change in an observable or more observables that the binding expression is referring to. For our sample, we will use the init function only and actually respond to changes manually because we have a more involved scenario than what the default mechanism would provide us with. The update function that you can add to a BindingHandler has the exact same signature as the init function; hence, it is called an update. Let's add the following code underneath the load call: ko.bindingHandlers.lineChart = {     init: function (element, valueAccessor, allValueAccessors, viewModel, bindingContext) {     } }; This is the core structure of a BindingHandler. As you can see, we've named the BindingHandler as lineChart. This is the name we will use in our view later on. The signature of init and update are the same. The first parameter represents the element that holds the binding expression, whereas the second valueAccessor parameter holds a function that enables us to access the value, which is a result of the expression. KnockoutJS deals with the expression internally and parses any expression and figures out how to expand any values, and so on. Add the following code into the init function: optionsInput = valueAccessor();   var options = {     title: optionsInput.title,     width: optionsInput.width || 300,     height: optionsInput.height || 300,     backgroundColor: 'transparent',     animation: {         duration: 1000,         easing: 'out'     } };   var dataHash = {};   var chart = new google.visualization.LineChart(element); var data = new google.visualization.DataTable(); data.addColumn('string', 'x'); data.addColumn('number', 'y');   function addRow(row, rowIndex) {     var value = row[1];     if (ko.isObservable(value)) {         value.subscribe(function (newValue) {             data.setValue(rowIndex, 1, newValue);             chart.draw(data, options);         });     }       var actualValue = ko.unwrap(value);     data.addRow([row[0], actualValue]);       dataHash[row[0]] = actualValue; };   optionsInput.data().forEach(addRow);   optionsInput.data.subscribe(function (newValue) {     newValue.forEach(function(row, rowIndex) {         if( !dataHash.hasOwnProperty(row[0])) {             addRow(row,rowIndex);         }     });       chart.draw(data, options); });         chart.draw(data, options); As you can see, observables has a function called subscribe(), which is the same for both an observable array and a regular observable. The code adds a subscription to the array itself; if there is any change to the array, we will find the change and add any new row to the chart. In addition, when we create a new row, we subscribe to any change in its value so that we can update the chart. In the ViewModel, the values were converted into observable values to accommodate this. View Go back to the index.html file; we need the UI for the two charts we're going to have. Plus, we need to get both the new BindingHandler loaded and also the ViewModel. Add the following script references after the last script reference already present, as shown here: <script type="text/javascript" src="googleCharts.js"></script> <script type="text/javascript" src="index.js"></script> Inside the <body> tag below the header, we want to add a bootstrap container and a row to hold two metro styled tiles and utilize our new BindingHandler. Also, we want a footer sitting at the bottom, as shown in the following code: <div class="container">     <div class="row">         <div class="col-sm-6 col-md-4">             <div class="thumbnail tile tile-green-sea tile-large">                 <div data-bind="lineChart: { title: 'Web Requests', width: 300, height: 300, data: requests }"></div>             </div>         </div>           <div class="col-sm-6 col-md-4">             <div class="thumbnail tile tile-pomegranate tile- large">                 <div data-bind="lineChart: { title: 'Failed Web Requests', width: 300, height: 300, data: failedRequests }"></div>             </div>         </div>     </div>       <hr />     <footer class="bs-footer" role="contentinfo">         <div class="container">             The Dashboard         </div>     </footer> </div> Note the data: requests and data: failedRequests are a part of the binding expressions. These will be handled and resolved by KnockoutJS internally and pointed to the observable arrays on the ViewModel. The other properties are options that go into the BindingHandler and something it forwards to the Google Charting APIs. Trying it all out Running the preceding code (Ctrl + F5) should yield the following result: If you open a second browser and go to the same URL, you will see the change in the chart in real time. Waiting approximately for 30 seconds and refreshing the browser should add a second point automatically and also animate the chart accordingly. Typing a URL with a file that does exist should have the same effect on the failed requests chart. Summary In this article, we had a brief encounter with MVVM as a pattern with the sole purpose of establishing good practices for your client code. We added this to a single page application setting, sprinkling on top the SignalR to communicate from the server to any connected client. Resources for Article: Further resources on this subject: Using R for Statistics Research and Graphics? [article] Aspects Data Manipulation in R [article] Learning Data Analytics R and Hadoop [article]
Read more
  • 0
  • 0
  • 1928

article-image-building-color-picker-hex-rgb-conversion
Packt
02 Mar 2015
18 min read
Save for later

Building a Color Picker with Hex RGB Conversion

Packt
02 Mar 2015
18 min read
In this article by Vijay Joshi, author of the book Mastering jQuery UI, we are going to create a color selector, or color picker, that will allow the users to change the text and background color of a page using the slider widget. We will also use the spinner widget to represent individual colors. Any change in colors using the slider will update the spinner and vice versa. The hex value of both text and background colors will also be displayed dynamically on the page. (For more resources related to this topic, see here.) This is how our page will look after we have finished building it: Setting up the folder structure To set up the folder structure, follow this simple procedure: Create a folder named Article inside the MasteringjQueryUI folder. Directly inside this folder, create an HTML file and name it index.html. Copy the js and css folder inside the Article folder as well. Now go inside the js folder and create a JavaScript file named colorpicker.js. With the folder setup complete, let's start to build the project. Writing markup for the page The index.html page will consist of two sections. The first section will be a text block with some text written inside it, and the second section will have our color picker controls. We will create separate controls for text color and background color. Inside the index.html file write the following HTML code to build the page skeleton: <html> <head> <link rel="stylesheet" href="css/ui-lightness/jquery-ui- 1.10.4.custom.min.css"> </head> <body> <div class="container"> <div class="ui-state-highlight" id="textBlock"> <p> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </p> <p> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </p> <p> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </p> </div> <div class="clear">&nbsp;</div> <ul class="controlsContainer"> <li class="left"> <div id="txtRed" class="red slider" data-spinner="sptxtRed" data-type="text"></div><input type="text" value="0" id="sptxtRed" data-slider="txtRed" readonly="readonly" /> <div id="txtGreen" class="green slider" dataspinner=" sptxtGreen" data-type="text"></div><input type="text" value="0" id="sptxtGreen" data-slider="txtGreen" readonly="readonly" /> <div id="txtBlue" class="blue slider" dataspinner=" sptxtBlue" data-type="text"></div><input type="text" value="0" id="sptxtBlue" data-slider="txtBlue" readonly="readonly" /> <div class="clear">&nbsp;</div> Text Color : <span>#000000</span> </li> <li class="right"> <div id="bgRed" class="red slider" data-spinner="spBgRed" data-type="bg" ></div><input type="text" value="255" id="spBgRed" data-slider="bgRed" readonly="readonly" /> <div id="bgGreen" class="green slider" dataspinner=" spBgGreen" data-type="bg" ></div><input type="text" value="255" id="spBgGreen" data-slider="bgGreen" readonly="readonly" /> <div id="bgBlue" class="blue slider" data-spinner="spBgBlue" data-type="bg" ></div><input type="text" value="255" id="spBgBlue" data-slider="bgBlue" readonly="readonly" /> <div class="clear">&nbsp;</div> Background Color : <span>#ffffff</span> </li> </ul> </div> <script src="js/jquery-1.10.2.js"></script> <script src="js/jquery-ui-1.10.4.custom.min.js"></script> <script src="js/colorpicker.js"></script> </body> </html> We started by including the jQuery UI CSS file inside the head section. Proceeding to the body section, we created a div with the container class, which will act as parent div for all the page elements. Inside this div, we created another div with id value textBlock and a ui-state-highlight class. We then put some text content inside this div. For this example, we have made three paragraph elements, each having some random text inside it. After div#textBlock, there is an unordered list with the controlsContainer class. This ul element has two list items inside it. First list item has the CSS class left applied to it and the second has CSS class right applied to it. Inside li.left, we created three div elements. Each of these three div elements will be converted to a jQuery slider and will represent the red (R), green (G), and blue (B) color code, respectively. Next to each of these divs is an input element where the current color code will be displayed. This input will be converted to a spinner as well. Let's look at the first slider div and the input element next to it. The div has id txtRed and two CSS classes red and slider applied to it. The red class will be used to style the slider and the slider class will be used in our colorpicker.js file. Note that this div also has two data attributes attached to it, the first is data-spinner, whose value is the id of the input element next to the slider div we have provided as sptxtRed, the second attribute is data-type, whose value is text. The purpose of the data-type attribute is to let us know whether this slider will be used for changing the text color or the background color. Moving on to the input element next to the slider now, we have set its id as sptxtRed, which should match the value of the data-spinner attribute on the slider div. It has another attribute named data-slider, which contains the id of the slider, which it is related to. Hence, its value is txtRed. Similarly, all the slider elements have been created inside div.left and each slider has an input next to id. The data-type attribute will have the text value for all sliders inside div.left. All input elements have also been assigned a value of 0 as the initial text color will be black. The same pattern that has been followed for elements inside div.left is also followed for elements inside div.right. The only difference is that the data-type value will be bg for slider divs. For all input elements, a value of 255 is set as the background color is white in the beginning. In this manner, all the six sliders and the six input elements have been defined. Note that each element has a unique ID. Finally, there is a span element inside both div.left and div.right. The hex color code will be displayed inside it. We have placed #000000 as the default value for the text color inside the span for the text color and #ffffff as the default value for the background color inside the span for background color. Lastly, we have included the jQuery source file, the jQuery UI source file, and the colorpicker.js file. With the markup ready, we can now write the properties for the CSS classes that we used here. Styling the content To make the page presentable and structured, we need to add CSS properties for different elements. We will do this inside the head section. Go to the head section in the index.html file and write these CSS properties for different elements: <style type="text/css">   body{     color:#025c7f;     font-family:Georgia,arial,verdana;     width:700px;     margin:0 auto;   }   .container{     margin:0 auto;     font-size:14px;     position:relative;     width:700px;     text-align:justify;    } #textBlock{     color:#000000;     background-color: #ffffff;   }   .ui-state-highlight{     padding: 10px;     background: none;   }   .controlsContainer{       border: 1px solid;       margin: 0;       padding: 0;       width: 100%;       float: left;   }   .controlsContainer li{       display: inline-block;       float: left;       padding: 0 0 0 50px;       width: 299px;   }   .controlsContainer div.ui-slider{       margin: 15px 0 0;       width: 200px;       float:left;   }   .left{     border-right: 1px solid;   }   .clear{     clear: both;   }     .red .ui-slider-range{ background: #ff0000; }   .green .ui-slider-range{ background: #00ff00; }   .blue .ui-slider-range{ background: #0000ff; }     .ui-spinner{       height: 20px;       line-height: 1px;       margin: 11px 0 0 15px;     }   input[type=text]{     margin-top: 0;     width: 30px;   } </style> First, we defined some general rules for page body and div .container. Then, we defined the initial text color and background color for the div with id textBlock. Next, we defined the CSS properties for the unordered list ul .controlsContainer and its list items. We have provided some padding and width to each list item. We have also specified the width and other properties for the slider as well. Since the class ui-slider is added by jQuery UI to a slider element after it is initialized, we have added our properties in the .controlsContainer div .ui-slider rule. To make the sliders attractive, we then defined the background colors for each of the slider bars by defining color codes for red, green, and blue classes. Lastly, CSS rules have been defined for the spinner and the input box. We can now check our progress by opening the index.html page in our browser. Loading it will display a page that resembles the following screenshot: It is obvious that sliders and spinners will not be displayed here. This is because we have not written the JavaScript code required to initialize those widgets. Our next section will take care of them. Implementing the color picker In order to implement the required functionality, we first need to initialize the sliders and spinners. Whenever a slider is changed, we need to update its corresponding spinner as well, and conversely if someone changes the value of the spinner, we need to update the slider to the correct value. In case any of the value changes, we will then recalculate the current color and update the text or background color depending on the context. Defining the object structure We will organize our code using the object literal. We will define an init method, which will be the entry point. All event handlers will also be applied inside this method. To begin with, go to the js folder and open the colorpicker.js file for editing. In this file, write the code that will define the object structure and a call to it: var colorPicker = {   init : function ()   {       },   setColor : function(slider, value)   {   },   getHexColor : function(sliderType)   {   },   convertToHex : function (val)   {   } }   $(function() {   colorPicker.init(); }); An object named colorPicker has been defined with four methods. Let's see what all these methods will do: init: This method will be the entry point where we will initialize all components and add any event handlers that are required. setColor: This method will be the main method that will take care of updating the text and background colors. It will also update the value of the spinner whenever the slider moves. This method has two parameters; the slider that was moved and its current value. getHexColor: This method will be called from within setColor and it will return the hex code based on the RGB values in the spinners. It takes a sliderType parameter based on which we will decide which color has to be changed; that is, text color or background color. The actual hex code will be calculated by the next method. convertToHex: This method will convert an RGB value for color into its corresponding hex value and return it to get a HexColor method. This was an overview of the methods we are going to use. Now we will implement these methods one by one, and you will understand them in detail. After the object definition, there is the jQuery's $(document).ready() event handler that will call the init method of our object. The init method In the init method, we will initialize the sliders and the spinners and set the default values for them as well. Write the following code for the init method in the colorpicker.js file:   init : function () {   var t = this;   $( ".slider" ).slider(   {     range: "min",     max: 255,     slide : function (event, ui)     {       t.setColor($(this), ui.value);     },     change : function (event, ui)     {       t.setColor($(this), ui.value);     }   });     $('input').spinner(   {     min :0,     max : 255,     spin : function (event, ui)     {       var sliderRef = $(this).data('slider');       $('#' + sliderRef).slider("value", ui.value);     }   });       $( "#txtRed, #txtGreen, #txtBlue" ).slider('value', 0);   $( "#bgRed, #bgGreen, #bgBlue" ).slider('value', 255); } In the first line, we stored the current scope value, this, in a local variable named t. Next, we will initialize the sliders. Since we have used the CSS class slider on each slider, we can simply use the .slider selector to select all of them. During initialization, we provide four options for sliders: range, max, slide, and change. Note the value for max, which has been set to 255. Since the value for R, G, or B can be only between 0 and 255, we have set max as 255. We do not need to specify min as it is 0 by default. The slide method has also been defined, which is invoked every time the slider handle moves. The call back for slide is calling the setColor method with an instance of the current slider and the value of the current slider. The setColor method will be explained in the next section. Besides slide, the change method is also defined, which also calls the setColor method with an instance of the current slider and its value. We use both the slide and change methods. This is because a change is called once the user has stopped sliding the slider handle and the slider value has changed. Contrary to this, the slide method is called each time the user drags the slider handle. Since we want to change colors while sliding as well, we have defined the slide as well as change methods. It is time to initialize the spinners now. The spinner widget is initialized with three properties. These are min and max, and the spin. min and max method has been set to 0 and 255, respectively. Every time the up/down button on the spinner is clicked or the up/down arrow key is used, the spin method will be called. Inside this method, $(this) refers to the current spinner. We find our related slider to this spinner by reading the data-slider attribute of this spinner. Once we get the exact slider, we set its value using the value method on the slider widget. Note that calling the value method will invoke the change method of the slider as well. This is the primary reason we have defined a callback for the change event while initializing the sliders. Lastly, we will set the default values for the sliders. For sliders inside div.left, we have set the value as 0 and for sliders inside div.right, the value is set to 255. You can now check the page on your browser. You will find that the slider and the spinner elements are initialized now, with the values we specified: You can also see that changing the spinner value using either the mouse or the keyboard will update the value of the slider as well. However, changing the slider value will not update the spinner. We will handle this in the next section where we will change colors as well. Changing colors and updating the spinner The setColor method is called each time the slider or the spinner value changes. We will now define this method to change the color based on whether the slider's or spinner's value was changed. Go to the setColor method declaration and write the following code: setColor : function(slider, value) {   var t = this;   var spinnerRef = slider.data('spinner');   $('#' + spinnerRef).spinner("value", value);     var sliderType = slider.data('type')     var hexColor = t.getHexColor(sliderType);   if(sliderType == 'text')   {       $('#textBlock').css({'color' : hexColor});       $('.left span:last').text(hexColor);                  }   else   {       $('#textBlock').css({'background-color' : hexColor});       $('.right span:last').text(hexColor);                  } } In the preceding code, we receive the current slider and its value as a parameter. First we get the related spinner to this slider using the data attribute spinner. Then we set the value of the spinner to the current value of the slider. Now we find out the type of slider for which setColor is being called and store it in the sliderType variable. The value for sliderType will either be text, in case of sliders inside div.left, or bg, in case of sliders inside div.right. In the next line, we will call the getHexColor method and pass the sliderType variable as its argument. The getHexColor method will return the hex color code for the selected color. Next, based on the sliderType value, we set the color of div#textBlock. If the sliderType is text, we set the color CSS property of div#textBlock and display the selected hex code in the span inside div.left. If the sliderType value is bg, we set the background color for div#textBlock and display the hex code for the background color in the span inside div.right. The getHexColor method In the preceding section, we called the getHexColor method with the sliderType argument. Let's define it first, and then we will go through it in detail. Write the following code to define the getHexColor method: getHexColor : function(sliderType) {   var t = this;   var allInputs;   var hexCode = '#';   if(sliderType == 'text')   {     //text color     allInputs = $('.left').find('input[type=text]');   }   else   {     //background color     allInputs = $('.right').find('input[type=text]');   }   allInputs.each(function (index, element) {     hexCode+= t.convertToHex($(element).val());   });     return hexCode; } The local variable t has stored this to point to the current scope. Another variable allInputs is declared, and lastly a variable to store the hex code has been declared, whose value has been set to # initially. Next comes the if condition, which checks the value of parameter sliderType. If the value of sliderType is text, it means we need to get all the spinner values to change the text color. Hence, we use jQuery's find selector to retrieve all input boxes inside div.left. If the value of sliderType is bg, it means we need to change the background color. Therefore, the else block will be executed and all input boxes inside div.right will be retrieved. To convert the color to hex, individual values for red, green, and blue will have to be converted to hex and then concatenated to get the full color code. Therefore, we iterate in inputs using the .each method. Another method convertToHex is called, which converts the value of a single input to hex. Inside the each method, we keep concatenating the hex value of the R, G, and B components to a variable hexCode. Once all iterations are done, we return the hexCode to the parent function where it is used. Converting to hex convertToHex is a small method that accepts a value and converts it to the hex equivalent. Here is the definition of the convertToHex method: convertToHex : function (val) {   var x  = parseInt(val, 10).toString(16);   return x.length == 1 ? "0" + x : x; } Inside the method, firstly we will convert the received value to an integer using the parseInt method and then we'll use JavaScript's toString method to convert it to hex, which has base 16. In the next line, we will check the length of the converted hex value. Since we want the 6-character dash notation for color (such as #ff00ff), we need two characters each for red, green, and blue. Hence, we check the length of the created hex value. If it is only one character, we append a 0 to the beginning to make it two characters. The hex value is then returned to the parent function. With this, our implementation is complete and we can check it on a browser. Load the page in your browser and play with the sliders and spinners. You will see the text or background color changing, based on their value: You will also see the hex code displayed below the sliders. Also note that changing the sliders will change the value of the corresponding spinner and vice versa. Improving the Colorpicker This was a very basic tool that we built. You can add many more features to it and enhance its functionality. Here are some ideas to get you started: Convert it into a widget where all the required DOM for sliders and spinners is created dynamically Instead of two sliders, incorporate the text and background changing ability into a single slider with two handles, but keep two spinners as usual Summary In this article, we created a basic color picker/changer using sliders and spinners. You can use it to view and change the colors of your pages dynamically. Resources for Article: Further resources on this subject: Testing Ui Using WebdriverJs? [article] Important Aspect Angularjs Ui Development [article] Kendo Ui Dataviz Advance Charting [article]
Read more
  • 0
  • 0
  • 5586

Packt
02 Mar 2015
19 min read
Save for later

Entity Framework DB First – Inheritance Relationships between Entities

Packt
02 Mar 2015
19 min read
This article is written by Rahul Rajat Singh, the author of Mastering Entity Framework. So far, we have seen how we can use various approaches of Entity Framework, how we can manage database table relationships, and how to perform model validations using Entity Framework. In this article, we will see how we can implement the inheritance relationship between the entities. We will see how we can change the generated conceptual model to implement the inheritance relationship, and how it will benefit us in using the entities in an object-oriented manner and the database tables in a relational manner. (For more resources related to this topic, see here.) Domain modeling using inheritance in Entity Framework One of the major challenges while using a relational database is to manage the domain logic in an object-oriented manner when the database itself is implemented in a relational manner. ORMs like Entity Framework provide the strongly typed objects, that is, entities for the relational tables. However, it might be possible that the entities generated for the database tables are logically related to each other, and they can be better modeled using inheritance relationships rather than having independent entities. Entity Framework lets us create inheritance relationships between the entities, so that we can work with the entities in an object-oriented manner, and internally, the data will get persisted in the respective tables. Entity Framework provides us three ways of object relational domain modeling using the inheritance relationship: The Table per Type (TPT) inheritance The Table per Class Hierarchy (TPH) inheritance The Table per Concrete Class (TPC) inheritance Let's now take a look at the scenarios where the generated entities are not logically related, and how we can use these inheritance relationships to create a better domain model by implementing inheritance relationships between entities using the Entity Framework Database First approach. The Table per Type inheritance The Table per Type (TPT) inheritance is useful when our database has tables that are related to each other using a one-to-one relationship. This relation is being maintained in the database by a shared primary key. To illustrate this, let's take a look at an example scenario. Let's assume a scenario where an organization maintains a database of all the people who work in a department. Some of them are employees getting a fixed salary, and some of them are vendors who are hired at an hourly rate. This is modeled in the database by having all the common data in a table called Person, and there are separate tables for the data that is specific to the employees and vendors. Let's visualize this scenario by looking at the database schema: The database schema showing the TPT inheritance database schema The ID column for the People table can be an auto-increment identity column, but it should not be an auto-increment identity column for the Employee and Vendors tables. In the preceding figure, the People table contains all the data common to both type of worker. The Employee table contains the data specific to the employees and the Vendors table contains the data specific to the vendors. These tables have a shared primary key and thus, there is a one-to-one relationship between the tables. To implement the TPT inheritance, we need to perform the following steps in our application: Generate the default Entity Data Model. Delete the default relationships. Add the inheritance relationship between the entities. Use the entities via the DBContext object. Generating the default Entity Data Model Let's add a new ADO.NET Entity Data Model to our application, and generate the conceptual Entity Model for these tables. The default generated Entity Model will look like this: The generated Entity Data Model where the TPT inheritance could be used Looking at the preceding conceptual model, we can see that Entity Framework is able to figure out the one-to-one relationship between the tables and creates the entities with the same relationship. However, if we take a look at the generated entities from our application domain perspective, it is fairly evident that these entities can be better managed if they have an inheritance relationship between them. So, let's see how we can modify the generated conceptual model to implement the inheritance relationship, and Entity Framework will take care of updating the data in the respective tables. Deleting default relationships The first thing we need to do to create the inheritance relationship is to delete the existing relationship from the Entity Model. This can be done by right-clicking on the relationship and selecting Delete from Model as follows: Deleting an existing relationship from the Entity Model Adding inheritance relationships between entities Once the relationships are deleted, we can add the new inheritance relationships in our Entity Model as follows: Adding inheritance relationships in the Entity Model When we add an inheritance relationship, the Visual Entity Designer will ask for the base class and derived class as follows: Selecting the base class and derived class participating in the inheritance relationship Once the inheritance relationship is created, the Entity Model will look like this: Inheritance relationship in the Entity Model After creating the inheritance relationship, we will get a compile error that the ID property is defined in all the entities. To resolve this problem, we need to delete the ID column from the derived classes. This will still keep the ID column that maps the derived classes as it is. So, from the application perspective, the ID column is defined in the base class but from the mapping perspective, it is mapped in both the base class and derived class, so that the data will get inserted into tables mapped in both the base and derived entities. With this inheritance relationship in place, the entities can be used in an object-oriented manner, and Entity Framework will take care of updating the respective tables for each entity. Using the entities via the DBContext object As we know, DbContext is the primary class that should be used to perform various operations on entities. Let's try to use our SampleDbContext class to create an Employee and a Vendor using this Entity Model and see how the data gets updated in the database: using (SampleDbEntities db = new SampleDbEntities()) { Employee employee = new Employee(); employee.FirstName = "Employee 1"; employee.LastName = "Employee 1"; employee.PhoneNumber = "1234567"; employee.Salary = 50000; employee.EmailID = "employee1@test.com"; Vendor vendor = new Vendor(); vendor.FirstName = "vendor 1"; vendor.LastName = "vendor 1"; vendor.PhoneNumber = "1234567"; vendor.HourlyRate = 100; vendor.EmailID = "vendor1@test.com"; db.Workers.Add(employee); db.Workers.Add(vendor); db.SaveChanges(); } In the preceding code, what we are doing is creating an object of the Employee and Vendor type, and then adding them to People using the DbContext object. What Entity Framework will do internally is that it will look at the mappings of the base entity and the derived entities, and then push the respective data into the respective tables. So, if we take a look at the data inserted in the database, it will look like the following: A database snapshot of the inserted data It is clearly visible from the preceding database snapshot that Entity Framework looks at our inheritance relationship and pushes the data into the Person, Employee, and Vendor tables. The Table per Class Hierarchy inheritance The Table per Class Hierarchy (TPH) inheritance is modeled by having a single database table for all the entity classes in the inheritance hierarchy. The TPH inheritance is useful in cases where all the information about the related entities is stored in a single table. For example, using the earlier scenario, let's try to model the database in such a way that it will only contain a single table called Workers to store the Employee and Vendor details. Let's try to visualize this table: A database schema showing the TPH inheritance database schema Now what will happen in this case is that the common fields will be populated whenever we create a type of worker. Salary will only contain a value if the worker is of type Employee. The HourlyRate field will be null in this case. If the worker is of type Vendor, then the HourlyRate field will have a value, and Salary will be null. This pattern is not very elegant from a database perspective. Since we are trying to keep unrelated data in a single table, our table is not normalized. There will always be some redundant columns that contain null values if we use this approach. We should try not to use this pattern unless it is absolutely needed. To implement the TPH inheritance relationship using the preceding table structure, we need to perform the following activities: Generate the default Entity Data Model. Add concrete classes to the Entity Data Model. Map the concrete class properties to their respective tables and columns. Make the base class entity abstract. Use the entities via the DBContext object. Let's discuss this in detail. Generating the default Entity Data Model Let's now generate the Entity Data Model for this table. The Entity Framework will create a single entity, Worker, for this table: The generated model for the table created for implementing the TPH inheritance Adding concrete classes to the Entity Data Model From the application perspective, it would be a much better solution if we have classes such as Employee and Vendor, which are derived from the Worker entity. The Worker class will contain all the common properties, and Employee and Vendor will contain their respective properties. So, let's add new entities for Employee and Vendor. While creating the entity, we can specify the base class entity as Worker, which is as follows: Adding a new entity in the Entity Data Model using a base class type Similarly, we will add the Vendor entity to our Entity Data Model, and specify the Worker entity as its base class entity. Once the entities are generated, our conceptual model will look like this: The Entity Data Model after adding the derived entities Next, we have to remove the Salary and HourlyRate properties from the Worker entity, and put them in the Employee and the Vendor entities respectively. So, once the properties are put into the respective entities, our final Entity Data model will look like this: The Entity Data Model after moving the respective properties into the derived entities Mapping the concrete class properties to the respective tables and columns After this, we have to define the column mappings in the derived classes to let the derived classes know which table and column should be used to put the data. We also need to specify the mapping condition. The Employee entity should save the Salary property's value in the Salary column of the Workers table when the Salary property is Not Null and HourlyRate is Null: Table mapping and conditions to map the Employee entity to the respective tables Once this mapping is done, we have to mark the Salary property as Nullable=false in the entity property window. This will let Entity Framework know that if someone is creating an object of the Employee type, then the Salary field is mandatory: Setting the Employee entity properties as Nullable Similarly, the Vendor entity should save the HourlyRate property's value in the HourlyRate column of the Workers table when Salary is Null and HourlyRate is Not Null: Table mapping and conditions to map the Vendor entity to the respective tables And similar to the Employee class, we also have to mark the HourlyRate property as Nullable=false in the Entity Property window. This will help Entity Framework know that if someone is creating an object of the Vendor type, then the HourlyRate field is mandatory: Setting the Vendor entity properties to Nullable Making the base class entity abstract There is one last change needed to be able to use these models. To be able to use these models, we need to mark the base class as abstract, so that Entity Framework is able to resolve the object of Employee and Vendors to the Workers table. Making the base class Workers as abstract This will also be a better model from the application perspective because the Worker entity itself has no meaning from the application domain perspective. Using the entities via the DBContext object Now we have our Entity Data Model configured to use the TPH inheritance. Let's try to create an Employee object and a Vendor object, and add them to the database using the TPH inheritance hierarchy: using (SampleDbEntities db = new SampleDbEntities()){Employee employee = new Employee();employee.FirstName = "Employee 1";employee.LastName = "Employee 1";employee.PhoneNumber = "1234567";employee.Salary = 50000;employee.EmailID = "employee1@test.com";Vendor vendor = new Vendor();vendor.FirstName = "vendor 1";vendor.LastName = "vendor 1";vendor.PhoneNumber = "1234567";vendor.HourlyRate = 100;vendor.EmailID = "vendor1@test.com";db.Workers.Add(employee);db.Workers.Add(vendor);db.SaveChanges();} In the preceding code, we created objects of the Employee and Vendor types, and then added them to the Workers collection using the DbContext object. Entity Framework will look at the mappings of the base entity and the derived entities, will check the mapping conditions and the actual values of the properties, and then push the data to the respective tables. So, let's take a look at the data inserted in the Workers table: A database snapshot after inserting the data using the Employee and Vendor entities So, we can see that for our Employee and Vendor models, the actual data is being kept in the same table using Entity Framework's TPH inheritance. The Table per Concrete Class inheritance The Table per Concrete Class (TPC) inheritance can be used when the database contains separate tables for all the logical entities, and these tables have some common fields. In our existing example, if there are two separate tables of Employee and Vendor, then the database schema would look like the following: The database schema showing the TPC inheritance database schema One of the major problems in such a database design is the duplication of columns in the tables, which is not recommended from the database normalization perspective. To implement the TPC inheritance, we need to perform the following tasks: Generate the default Entity Data Model. Create the abstract class. Modify the CDSL to cater to the change. Specify the mapping to implement the TPT inheritance. Use the entities via the DBContext object. Generating the default Entity Data Model Let's now take a look at the generated entities for this database schema: The default generated entities for the TPC inheritance database schema Entity Framework has given us separate entities for these two tables. From our application domain perspective, we can use these entities in a better way if all the common properties are moved to a common abstract class. The Employee and Vendor entities will contain the properties specific to them and inherit from this abstract class to use all the common properties. Creating the abstract class Let's add a new entity called Worker to our conceptual model and move the common properties into this entity: Adding a base class for all the common properties Next, we have to mark this class as abstract from the properties window: Marking the base class as abstract class Modifying the CDSL to cater to the change Next, we have to specify the mapping for these tables. Unfortunately, the Visual Entity Designer has no support for this type of mapping, so we need to perform this mapping ourselves in the EDMX XML file. The conceptual schema definition language (CSDL) part of the EDMX file is all set since we have already moved the common properties into the abstract class. So, now we should be able to use these properties with an abstract class handle. The problem will come in the storage schema definition language (SSDL) and mapping specification language (MSL). The first thing that we need to do is to change the SSDL to let Entity Framework know that the abstract class Worker is capable of saving the data in two tables. This can be done by setting the EntitySet name in the EntityContainer tags as follows: <EntityContainer Name="todoDbModelStoreContainer">   <EntitySet Name="Employee" EntityType="Self.Employee" Schema="dbo" store_Type="Tables" />   <EntitySet Name="Vendor" EntityType="Self.Vendor" Schema="dbo" store_Type="Tables" /></EntityContainer> Specifying the mapping to implement the TPT inheritance Next, we need to change the MSL to properly map the properties to the respective tables based on the actual type of object. For this, we have to specify EntitySetMapping. The EntitySetMapping should look like the following: <EntityContainerMapping StorageEntityContainer="todoDbModelStoreContainer" CdmEntityContainer="SampleDbEntities">    <EntitySetMapping Name="Workers">   <EntityTypeMapping TypeName="IsTypeOf(SampleDbModel.Vendor)">       <MappingFragment StoreEntitySet="Vendor">       <ScalarProperty Name="HourlyRate" ColumnName="HourlyRate" />       <ScalarProperty Name="EMailId" ColumnName="EMailId" />       <ScalarProperty Name="PhoneNumber" ColumnName="PhoneNumber" />       <ScalarProperty Name="LastName" ColumnName="LastName" />       <ScalarProperty Name="FirstName" ColumnName="FirstName" />       <ScalarProperty Name="ID" ColumnName="ID" />       </MappingFragment>   </EntityTypeMapping>      <EntityTypeMapping TypeName="IsTypeOf(SampleDbModel.Employee)">       <MappingFragment StoreEntitySet="Employee">       <ScalarProperty Name="ID" ColumnName="ID" />       <ScalarProperty Name="Salary" ColumnName="Salary" />       <ScalarProperty Name="EMailId" ColumnName="EMailId" />       <ScalarProperty Name="PhoneNumber" ColumnName="PhoneNumber" />       <ScalarProperty Name="LastName" ColumnName="LastName" />       <ScalarProperty Name="FirstName" ColumnName="FirstName" />       </MappingFragment>   </EntityTypeMapping>   </EntitySetMapping></EntityContainerMapping> In the preceding code, we specified that if the actual type of object is Vendor, then the properties should map to the columns in the Vendor table, and if the actual type of entity is Employee, the properties should map to the Employee table, as shown in the following screenshot: After EDMX modifications, the mapping are visible in Visual Entity Designer If we now open the EDMX file again, we can see the properties being mapped to the respective tables in the respective entities. Doing this mapping from Visual Entity Designer is not possible, unfortunately. Using the entities via the DBContext object Let's use these "entities from our code: using (SampleDbEntities db = new SampleDbEntities()) { Employee employee = new Employee(); employee.FirstName = "Employee 1"; employee.LastName = "Employee 1"; employee.PhoneNumber = "1234567"; employee.Salary = 50000; employee.EMailId = "employee1@test.com"; Vendor vendor = new Vendor(); vendor.FirstName = "vendor 1"; vendor.LastName = "vendor 1"; vendor.PhoneNumber = "1234567"; vendor.HourlyRate = 100; vendor.EMailId = "vendor1@test.com"; db.Workers.Add(employee); db.Workers.Add(vendor); db.SaveChanges(); } In the preceding code, we created objects of the Employee and Vendor types and saved them using the Workers entity set, which is actually an abstract class. If we take a look at the inserted database, we will see the following: Database snapshot of the inserted data using TPC inheritance From the preceding screenshot, it is clear that the data is being pushed to the respective tables. The insert operation we saw in the previous code is successful but there will be an exception in the application. This exception is because when Entity Framework tries to access the values that are in the abstract class, it finds two records with same ID, and since the ID column is specified as a primary key, two records with the same value is a problem in this scenario. This exception clearly shows that the store/database generated identity columns will not work with the TPC inheritance. If we want to use the TPC inheritance, then we either need to use GUID based IDs, or pass the ID from the application, or perhaps use some database mechanism that can maintain the uniqueness of auto-generated columns across multiple tables. Choosing the inheritance strategy Now that we know about all the inheritance strategies supported by Entity Framework, let's try to analyze these approaches. The most important thing is that there is no single strategy that will work for all the scenarios. Especially if we have a legacy database. The best option would be to analyze the application requirements and then look at the existing table structure to see which approach is best suited. The Table per Class Hierarchy inheritance tends to give us denormalized tables and have redundant columns. We should only use it when the number of properties in the derived classes is very less, so that the number of redundant columns is also less, and this denormalized structure will not create problems over a period of time. Contrary to TPH, if we have a lot of properties specific to derived classes and only a few common properties, we can use the Table per Concrete Class inheritance. However, in this approach, we will end up with some properties being repeated in all the tables. Also, this approach imposes some limitations such as we cannot use auto-increment identity columns in the database. If we have a lot of common properties that could go into a base class and a lot of properties specific to derived classes, then perhaps Table per Type is the best option to go with. In any case, complex inheritance relationships that become unmanageable in the long run should be avoided. One alternative could be to have separate domain models to implement the application logic in an object-oriented manner, and then use mappers to map these domain models to Entity Framework's generated entity models. Summary In this article, we looked at the various types of inheritance relationship using Entity Framework. We saw how these inheritance relationships can be implemented, and some guidelines on which should be used in which scenario. Resources for Article: Further resources on this subject: Working with Zend Framework 2.0 [article] Hosting the service in IIS using the TCP protocol [article] Applying LINQ to Entities to a WCF Service [article]
Read more
  • 0
  • 0
  • 15753
article-image-dealing-interrupts
Packt
02 Mar 2015
19 min read
Save for later

Dealing with Interrupts

Packt
02 Mar 2015
19 min read
This article is written by Francis Perea, the author of the book Arduino Essentials. In all our previous projects, we have been constantly looking for events to occur. We have been polling, but looking for events to occur supposes a relatively big effort and a waste of CPU cycles to only notice that nothing happened. In this article, we will learn about interrupts as a totally new way to deal with events, being notified about them instead of looking for them constantly. Interrupts may be really helpful when developing projects in which fast or unknown events may occur, and thus we will see a very interesting project which will lead us to develop a digital tachograph for a computer-controlled motor. Are you ready? Here we go! (For more resources related to this topic, see here.) The concept of an interruption As you may have intuited, an interrupt is a special mechanism the CPU incorporates to have a direct channel to be noticed when some event occurs. Most Arduino microcontrollers have two of these: Interrupt 0 on digital pin 2 Interrupt 1 on digital pin 3 But some models, such as the Mega2560, come with up to five interrupt pins. Once an interrupt has been notified, the CPU completely stops what it was doing and goes on to look at it, by running a special dedicated function in our code called Interrupt Service Routine (ISR). When I say that the CPU completely stops, I mean that even functions such as delay() or millis() won't be updated while the ISR is being executed. Interrupts can be programmed to respond on different changes of the signal connected to the corresponding pin and thus the Arduino language has four predefined constants to represent each of these four modes: LOW: It will trigger the interrupt whenever the pin gets a LOW value CHANGE: The interrupt will be triggered when the pins change their values from HIGH to LOW or vice versa RISING: It will trigger the interrupt when signal goes from LOW to HIGH FALLING: It is just the opposite of RISING; the interrupt will be triggered when the signal goes from HIGH to LOW The ISR The function that the CPU will call whenever an interrupt occurs is so important to the micro that it has to accomplish a pair of rules: They can't have any parameter They can't return anything The interrupts can be executed only one at a time Regarding the first two points, they mean that we can neither pass nor receive any data from the ISR directly, but we have other means to achieve this communication with the function. We will use global variables for it. We can set and read from a global variable inside an ISR, but even so, these variables have to be declared in a special way. We have to declare them as volatile as we will see this later on in the code. The third point, which specifies that only one ISR can be attended at a time, is what makes the function millis() not being able to be updated. The millis() function relies on an interrupt to be updated, and this doesn't happen if another interrupt is already being served. As you may understand, ISR is critical to the correct code execution in a microcontroller. As a rule of thumb, we will try to keep our ISRs as simple as possible and leave all heavy weight processing that occurs outside of it, in the main loop of our code. The tachograph project To understand and manage interrupts in our projects, I would like to offer you a very particular one, a tachograph, a device that is present in all our cars and whose mission is to account for revolutions, normally the engine revolutions, but also in brake systems such as Anti-lock Brake System (ABS) and others. Mechanical considerations Well, calling it mechanical perhaps is too much, but let's make some considerations regarding how we are going to make our project account for revolutions. For this example project, I have used a small DC motor driven through a small transistor and, like in lots of industrial applications, an encoded wheel is a perfect mechanism to read the number of revolutions. By simply attaching a small disc of cardboard perpendicularly to your motor shaft, it is very easy to achieve it. By using our old friend, the optocoupler, we can sense something between its two parts, even with just a piece of cardboard with a small slot in just one side of its surface. Here, you can see the template I elaborated for such a disc, the cross in the middle will help you position the disc as perfectly as possible, that is, the cross may be as close as possible to the motor shaft. The slot has to be cut off of the black rectangle as shown in the following image: The template for the motor encoder Once I printed it, I glued it to another piece of cardboard to make it more resistant and glued it all to the crown already attached to my motor shaft. If yours doesn't have a surface big enough to glue the encoder disc to its shaft, then perhaps you can find a solution by using just a small piece of dough or similar to it. Once the encoder disc is fixed to the motor and spins attached to the motor shaft, we have to find a way to place the optocoupler in a way that makes it able to read through the encoder disc slot. In my case, just a pair of drops of glue did the trick, but if your optocoupler or motor doesn't allow you to apply this solution, I'm sure that a pair of zip ties or a small piece of dough can give you another way to fix it to the motor too. In the following image, you can see my final assembled motor with its encoder disc and optocoupler ready to be connected to the breadboard through alligator clips: The complete assembly for the motor encoder Once we have prepared our motor encoder, let's perform some tests to see it working and begin to write code to deal with interruptions. A simple interrupt tester Before going deep inside the whole code project, let's perform some tests to confirm that our encoder assembly is working fine and that we can correctly trigger an interrupt whenever the motor spins and the cardboard slot passes just through the optocoupler. The only thing you have to connect to your Arduino at the moment is the optocoupler; we will now operate our motor by hand and in a later section, we will control its speed from the computer. The test's circuit schematic is as follows: A simple circuit to test the encoder Nothing new in this circuit, it is almost the same as the one used in the optical coin detector, with the only important and necessary difference of connecting the wire coming from the detector side of the optocoupler to pin 2 of our Arduino board, because, as said in the preceding text, the interrupt 0 is available only through that pin. For this first test, we will make the encoder disc spin by hand, which allows us to clearly perceive when the interrupt triggers. For the rest of this example, we will use the LED included with the Arduino board connected to pin 13 as a way to visually indicate that the interrupts have been triggered. Our first interrupt and its ISR Once we have connected the optocoupler to the Arduino and prepared things to trigger some interrupts, let's see the code that we will use to test our assembly. The objective of this simple sketch is to commute the status of an LED every time an interrupt occurs. In the proposed tester circuit, the LED status variable will be changed every time the slot passes through the optocoupler: /*  Chapter 09 - Dealing with interrupts  A simple tester  By Francis Perea for Packt Publishing */   // A LED will be used to notify the change #define ledPin 13   // Global variables we will use // A variable to be used inside ISR volatile int status = LOW;   // A function to be called when the interrupt occurs void revolution(){   // Invert LED status   status=!status; }   // Configuration of the board: just one output void setup() {   pinMode(ledPin, OUTPUT);   // Assign the revolution() function as an ISR of interrupt 0   // Interrupt will be triggered when the signal goes from   // LOW to HIGH   attachInterrupt(0, revolution, RISING); }   // Sketch execution loop void loop(){    // Set LED status   digitalWrite(ledPin, status); } Let's take a look at its most important aspects. The LED pin apart, we declare a variable to account for changes occurring. It will be updated in the ISR of our interrupt; so, as I told you earlier, we declare it as follows: volatile int status = LOW; Following which we declare the ISR function, revolution(), which as we already know doesn't receive any parameter nor return any value. And as we said earlier, it must be as simple as possible. In our test case, the ISR simply inverts the value of the global volatile variable to its opposite value, that is, from LOW to HIGH and from HIGH to LOW. To allow our ISR to be called whenever an interrupt 0 occurs, in the setup() function, we make a call to the attachInterrupt() function by passing three parameters to it: Interrupt: The interrupt number to assign the ISR to ISR: The name without the parentheses of the function that will act as the ISR for this interrupt Mode: One of the following already explained modes that define when exactly the interrupt will be triggered In our case, the concrete sentence is as follows: attachInterrupt(0, revolution, RISING); This makes the function revolution() be the ISR of interrupt 0 that will be triggered when the signal goes from LOW to HIGH. Finally, in our main loop there is little to do. Simply update the LED based on the current value of the status variable that is going to be updated inside the ISR. If everything went right, you should see the LED commute every time the slot passes through the optocoupler as a consequence of the interrupt being triggered and the revolution() function inverting the value of the status variable that is used in the main loop to set the LED accordingly. A dial tachograph For a more complete example in this section, we will build a tachograph, a device that will present the current revolutions per minute of the motor in a visual manner by using a dial. The motor speed will be commanded serially from our computer by reusing some of the codes in our previous projects. It is not going to be very complicated if we include some way to inform about an excessive number of revolutions and even cut the engine in an extreme case to protect it, is it? The complete schematic of such a big circuit is shown in the following image. Don't get scared about the number of components as we have already seen them all in action before: The tachograph circuit As you may see, we will use a total of five pins of our Arduino board to sense and command such a set of peripherals: Pin 2: This is the interrupt 0 pin and thus it will be used to connect the output of the optocoupler. Pin 3: It will be used to deal with the servo to move the dial. Pin 4: We will use this pin to activate sound alarm once the engine current has been cut off to prevent overcharge. Pin 6: This pin will be used to deal with the motor transistor that allows us to vary the motor speed based on the commands we receive serially. Remember to use a PWM pin if you choose to use another one. Pin 13: Used to indicate with an LED an excessive number of revolutions per minute prior to cutting the engine off. There are also two more pins which, although not physically connected, will be used, pins 0 and 1, given that we are going to talk to the device serially from the computer. Breadboard connections diagram There are some wires crossed in the previous schematic, and perhaps you can see the connections better in the following breadboard connection image: Breadboard connection diagram for the tachograph The complete tachograph code This is going to be a project full of features and that is why it has such a number of devices to interact with. Let's resume the functioning features of the dial tachograph: The motor speed is commanded from the computer via a serial communication with up to five commands: Increase motor speed (+) Decrease motor speed (-) Totally stop the motor (0) Put the motor at full throttle (*) Reset the motor after a stall (R) Motor revolutions will be detected and accounted by using an encoder and an optocoupler Current revolutions per minute will be visually presented with a dial operated with a servomotor It gives visual indication via an LED of a high number of revolutions In case a maximum number of revolutions is reached, the motor current will be cut off and an acoustic alarm will sound With such a number of features, it is normal that the code for this project is going to be a bit longer than our previous sketches. Here is the code: /*  Chapter 09 - Dealing with interrupt  Complete tachograph system  By Francis Perea for Packt Publishing */   #include <Servo.h>   //The pins that will be used #define ledPin 13 #define motorPin 6 #define buzzerPin 4 #define servoPin 3   #define NOTE_A4 440 // Milliseconds between every sample #define sampleTime 500 // Motor speed increment #define motorIncrement 10 // Range of valir RPMs, alarm and stop #define minRPM  0 #define maxRPM 10000 #define alarmRPM 8000 #define stopRPM 9000   // Global variables we will use // A variable to be used inside ISR volatile unsigned long revolutions = 0; // Total number of revolutions in every sample long lastSampleRevolutions = 0; // A variable to convert revolutions per sample to RPM int rpm = 0; // LED Status int ledStatus = LOW; // An instace on the Servo class Servo myServo; // A flag to know if the motor has been stalled boolean motorStalled = false; // Thr current dial angle int dialAngle = 0; // A variable to store serial data int dataReceived; // The current motor speed int speed = 0; // A time variable to compare in every sample unsigned long lastCheckTime;   // A function to be called when the interrupt occurs void revolution(){   // Increment the total number of   // revolutions in the current sample   revolutions++; }   // Configuration of the board void setup() {   // Set output pins   pinMode(motorPin, OUTPUT);   pinMode(ledPin, OUTPUT);   pinMode(buzzerPin, OUTPUT);   // Set revolution() as ISR of interrupt 0   attachInterrupt(0, revolution, CHANGE);   // Init serial communication   Serial.begin(9600);   // Initialize the servo   myServo.attach(servoPin);   //Set the dial   myServo.write(dialAngle);   // Initialize the counter for sample time   lastCheckTime = millis(); }   // Sketch execution loop void loop(){    // If we have received serial data   if (Serial.available()) {     // read the next char      dataReceived = Serial.read();      // Act depending on it      switch (dataReceived){        // Increment speed        case '+':          if (speed<250) {            speed += motorIncrement;          }          break;        // Decrement speed        case '-':          if (speed>5) {            speed -= motorIncrement;          }          break;                // Stop motor        case '0':          speed = 0;          break;            // Full throttle           case '*':          speed = 255;          break;        // Reactivate motor after stall        case 'R':          speed = 0;          motorStalled = false;          break;      }     //Only if motor is active set new motor speed     if (motorStalled == false){       // Set the speed motor speed       analogWrite(motorPin, speed);     }   }   // If a sample time has passed   // We have to take another sample   if (millis() - lastCheckTime > sampleTime){     // Store current revolutions     lastSampleRevolutions = revolutions;     // Reset the global variable     // So the ISR can begin to count again     revolutions = 0;     // Calculate revolution per minute     rpm = lastSampleRevolutions * (1000 / sampleTime) * 60;     // Update last sample time     lastCheckTime = millis();     // Set the dial according new reading     dialAngle = map(rpm,minRPM,maxRPM,180,0);     myServo.write(dialAngle);   }   // If the motor is running in the red zone   if (rpm > alarmRPM){     // Turn on LED     digitalWrite(ledPin, HIGH);   }   else{     // Otherwise turn it off     digitalWrite(ledPin, LOW);   }   // If the motor has exceed maximum RPM   if (rpm > stopRPM){     // Stop the motor     speed = 0;     analogWrite(motorPin, speed);     // Disable it until a 'R' command is received     motorStalled = true;     // Make alarm sound     tone(buzzerPin, NOTE_A4, 1000);   }   // Send data back to the computer   Serial.print("RPM: ");   Serial.print(rpm);   Serial.print(" SPEED: ");   Serial.print(speed);   Serial.print(" STALL: ");   Serial.println(motorStalled); } It is the first time in this article that I think I have nothing to explain regarding the code that hasn't been already explained before. I have commented everything so that the code can be easily read and understood. In general lines, the code declares both constants and global variables that will be used and the ISR for the interrupt. In the setup section, all initializations of different subsystems that need to be set up before use are made: pins, interrupts, serials, and servos. The main loop begins by looking for serial commands and basically updates the speed value and the stall flag if command R is received. The final motor speed setting only occurs in case the stall flag is not on, which will occur in case the motor reaches the stopRPM value. Following with the main loop, the code looks if it has passed a sample time, in which case the revolutions are stored to compute real revolutions per minute (rpm), and the global revolutions counter incremented inside the ISR is set to 0 to begin again. The current rpm value is mapped to an angle to be presented by the dial and thus the servo is set accordingly. Next, a pair of controls is made: One to see if the motor is getting into the red zone by exceeding the max alarmRPM value and thus turning the alarm LED on And another to check if the stopRPM value has been reached, in which case the motor will be automatically cut off, the motorStalled flag is set to true, and the acoustic alarm is triggered When the motor has been stalled, it won't accept changes in its speed until it has been reset by issuing an R command via serial communication. In the last action, the code sends back some info to the Serial Monitor as another way of feedback with the operator at the computer and this should look something like the following screenshot: Serial Monitor showing the tachograph in action Modular development It has been quite a complex project in that it incorporates up to six different subsystems: optocoupler, motor, LED, buzzer, servo, and serial, but it has also helped us to understand that projects need to be developed by using a modular approach. We have worked and tested every one of these subsystems before, and that is the way it should usually be done. By developing your projects in such a submodular way, it will be easy to assemble and program the whole of the system. As you may see in the following screenshot, only by using such a modular way of working will you be able to connect and understand such a mess of wires: A working desktop may get a bit messy Summary I'm sure you have got the point regarding interrupts with all the things we have seen in this article. We have met and understood what an interrupt is and how does the CPU attend to it by running an ISR, and we have even learned about their special characteristics and restrictions and that we should keep them as little as possible. On the programming side, the only thing necessary to work with interrupts is to correctly attach the ISR with a call to the attachInterrupt() function. From the point of view of hardware, we have assembled an encoder that has been attached to a spinning motor to account for its revolutions. Finally, we have the code. We have seen a relatively long sketch, which is a sign that we are beginning to master the platform, are able to deal with a bigger number of peripherals, and that our projects require more complex software every time we have to deal with these peripherals and to accomplish all the other necessary tasks to meet what is specified in the project specifications. Resources for Article: Further resources on this subject: The Arduino Mobile Robot? [article] Using the Leap Motion Controller with Arduino [article] Android and Udoo Home Automation [article]
Read more
  • 0
  • 0
  • 28248

article-image-quick-start-guide-flume
Packt
02 Mar 2015
15 min read
Save for later

A Quick Start Guide to Flume

Packt
02 Mar 2015
15 min read
In this article by Steve Hoffman, the author of the book, Apache Flume: Distributed Log Collection for Hadoop - Second Edition, we will learn about the basics that are required to be known before we start working with Apache Flume. This article will help you get started with Flume. So, let's start with the first step: downloading and configuring Flume. (For more resources related to this topic, see here.) Downloading Flume Let's download Flume from http://flume.apache.org/. Look for the download link in the side navigation. You'll see two compressed .tar archives available along with the checksum and GPG signature files used to verify the archives. Instructions to verify the download are on the website, so I won't cover them here. Checking the checksum file contents against the actual checksum verifies that the download was not corrupted. Checking the signature file validates that all the files you are downloading (including the checksum and signature) came from Apache and not some nefarious location. Do you really need to verify your downloads? In general, it is a good idea and it is recommended by Apache that you do so. If you choose not to, I won't tell. The binary distribution archive has bin in the name, and the source archive is marked with src. The source archive contains just the Flume source code. The binary distribution is much larger because it contains not only the Flume source and the compiled Flume components (jars, javadocs, and so on), but also all the dependent Java libraries. The binary package contains the same Maven POM file as the source archive, so you can always recompile the code even if you start with the binary distribution. Go ahead, download and verify the binary distribution to save us some time in getting started. Flume in Hadoop distributions Flume is available with some Hadoop distributions. The distributions supposedly provide bundles of Hadoop's core components and satellite projects (such as Flume) in a way that ensures things such as version compatibility and additional bug fixes are taken into account. These distributions aren't better or worse; they're just different. There are benefits to using a distribution. Someone else has already done the work of pulling together all the version-compatible components. Today, this is less of an issue since the Apache BigTop project started (http://bigtop.apache.org/). Nevertheless, having prebuilt standard OS packages, such as RPMs and DEBs, ease installation as well as provide startup/shutdown scripts. Each distribution has different levels of free and paid options, including paid professional services if you really get into a situation you just can't handle. There are downsides, of course. The version of Flume bundled in a distribution will often lag quite a bit behind the Apache releases. If there is a new or bleeding-edge feature you are interested in using, you'll either be waiting for your distribution's provider to backport it for you, or you'll be stuck patching it yourself. Furthermore, while the distribution providers do a fair amount of testing, such as any general-purpose platform, you will most likely encounter something that their testing didn't cover, in which case, you are still on the hook to come up with a workaround or dive into the code, fix it, and hopefully, submit that patch back to the open source community (where, at a future point, it'll make it into an update of your distribution or the next version). So, things move slower in a Hadoop distribution world. You can see that as good or bad. Usually, large companies don't like the instability of bleeding-edge technology or making changes often, as change can be the most common cause of unplanned outages. You'd be hard pressed to find such a company using the bleeding-edge Linux kernel rather than something like Red Hat Enterprise Linux (RHEL), CentOS, Ubuntu LTS, or any of the other distributions whose target is stability and compatibility. If you are a startup building the next Internet fad, you might need that bleeding-edge feature to get a leg up on the established competition. If you are considering a distribution, do the research and see what you are getting (or not getting) with each. Remember that each of these offerings is hoping that you'll eventually want and/or need their Enterprise offering, which usually doesn't come cheap. Do your homework. Here's a short, nondefinitive list of some of the more established players. For more information, refer to the following links: Cloudera: http://cloudera.com/ Hortonworks: http://hortonworks.com/ MapR: http://mapr.com/ An overview of the Flume configuration file Now that we've downloaded Flume, let's spend some time going over how to configure an agent. A Flume agent's default configuration provider uses a simple Java property file of key/value pairs that you pass as an argument to the agent upon startup. As you can configure more than one agent in a single file, you will need to additionally pass an agent identifier (called a name) so that it knows which configurations to use. In my examples where I'm only specifying one agent, I'm going to use the name agent. By default, the configuration property file is monitored for changes every 30 seconds. If a change is detected, Flume will attempt to reconfigure itself. In practice, many of the configuration settings cannot be changed after the agent has started. Save yourself some trouble and pass the undocumented --no-reload-conf argument when starting the agent (except in development situations perhaps). If you use the Cloudera distribution, the passing of this flag is currently not possible. I've opened a ticket to fix that at https://issues.cloudera.org/browse/DISTRO-648. If this is important to you, please vote it up. Each agent is configured, starting with three parameters: agent.sources=<list of sources>agent.channels=<list of channels>agent.sinks=<list of sinks> Each source, channel, and sink also has a unique name within the context of that agent. For example, if I'm going to transport my Apache access logs, I might define a channel named access. The configurations for this channel would all start with the agent.channels.access prefix. Each configuration item has a type property that tells Flume what kind of source, channel, or sink it is. In this case, we are going to use an in-memory channel whose type is memory. The complete configuration for the channel named access in the agent named agent would be: agent.channels.access.type=memory Any arguments to a source, channel, or sink are added as additional properties using the same prefix. The memory channel has a capacity parameter to indicate the maximum number of Flume events it can hold. Let's say we didn't want to use the default value of 100; our configuration would now look like this: agent.channels.access.type=memoryagent.channels.access.capacity=200 Finally, we need to add the access channel name to the agent.channels property so that the agent knows to load it: agent.channels=access Let's look at a complete example using the canonical "Hello, World!" example. Starting up with "Hello, World!" No technical article would be complete without a "Hello, World!" example. Here is the configuration file we'll be using: agent.sources=s1agent.channels=c1agent.sinks=k agent.sources.s1.type=netcatagent.sources.s1.channels=c1agent.sources.s1.bind=0.0.0.0agent.sources.s1.port=1234 agent.channels.c1.type=memory agent.sinks.k1.type=loggeragent.sinks.k1.channel=c1 Here, I've defined one agent (called agent) who has a source named s1, a channel named c1, and a sink named k1. The s1 source's type is netcat, which simply opens a socket listening for events (one line of text per event). It requires two parameters: a bind IP and a port number. In this example, we are using 0.0.0.0 for a bind address (the Java convention to specify listen on any address) and port 12345. The source configuration also has a parameter called channels (plural), which is the name of the channel(s) the source will append events to, in this case, c1. It is plural, because you can configure a source to write to more than one channel; we just aren't doing that in this simple example. The channel named c1 is a memory channel with a default configuration. The sink named k1 is of the logger type. This is a sink that is mostly used for debugging and testing. It will log all events at the INFO level using Log4j, which it receives from the configured channel, in this case, c1. Here, the channel keyword is singular because a sink can only be fed data from one channel. Using this configuration, let's run the agent and connect to it using the Linux netcat utility to send an event. First, explode the .tar archive of the binary distribution we downloaded earlier: $ tar -zxf apache-flume-1.5.2-bin.tar.gz$ cd apache-flume-1.5.2-bin Next, let's briefly look at the help. Run the flume-ng command with the help command: $ ./bin/flume-ng helpUsage: ./bin/flume-ng <command> [options]... commands:help                 display this help textagent                run a Flume agentavro-client           run an avro Flume clientversion               show Flume version info global options:--conf,-c <conf>     use configs in <conf> directory--classpath,-C <cp>   append to the classpath--dryrun,-d          do not actually start Flume, just print the command--plugins-path <dirs> colon-separated list of plugins.d directories. See the                       plugins.d section in the user guide for more details.                       Default: $FLUME_HOME/plugins.d-Dproperty=value     sets a Java system property value-Xproperty=value     sets a Java -X option agent options:--conf-file,-f <file> specify a config file (required)--name,-n <name>     the name of this agent (required)--help,-h             display help text avro-client options:--rpcProps,-P <file>   RPC client properties file with server connection params--host,-H <host>       hostname to which events will be sent--port,-p <port>       port of the avro source--dirname <dir>       directory to stream to avro source--filename,-F <file>   text file to stream to avro source (default: std input)--headerFile,-R <file> File containing event headers as key/value pairs on each new line--help,-h             display help text Either --rpcProps or both --host and --port must be specified. Note that if <conf> directory is specified, then it is always included first in the classpath. As you can see, there are two ways with which you can invoke the command (other than the simple help and version commands). We will be using the agent command. The use of avro-client will be covered later. The agent command has two required parameters: a configuration file to use and the agent name (in case your configuration contains multiple agents). Let's take our sample configuration and open an editor (vi in my case, but use whatever you like): $ vi conf/hw.conf Next, place the contents of the preceding configuration into the editor, save, and exit back to the shell. Now you can start the agent: $ ./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console The -Dflume.root.logger property overrides the root logger in conf/log4j.properties to use the console appender. If we didn't override the root logger, everything would still work, but the output would go to the log/flume.log file instead of being based on the contents of the default configuration file. Of course, you can edit the conf/log4j.properties file and change the flume.root.logger property (or anything else you like). To change just the path or filename, you can set the flume.log.dir and flume.log.file properties in the configuration file or pass additional flags on the command line as follows: $ ./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console -Dflume.log.dir=/tmp -Dflume.log.file=flume-agent.log You might ask why you need to specify the -c parameter, as the -f parameter contains the complete relative path to the configuration. The reason for this is that the Log4j configuration file should be included on the class path. If you left the -c parameter off the command, you'll see this error: Warning: No configuration directory set! Use --conf <dir> to override.log4j:WARN No appenders could be found for logger (org.apache.flume.lifecycle.LifecycleSupervisor).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info But you didn't do that so you should see these key log lines: 2014-10-05 15:39:06,109 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:140)] Post-validation flume configuration contains configuration foragents: [agent] This line tells you that your agent starts with the name agent. Usually you'd look for this line only to be sure you started the right configuration when you have multiple configurations defined in your configuration file. 2014-10-05 15:39:06,076 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)] Reloadingconfiguration file:conf/hw.conf This is another sanity check to make sure you are loading the correct file, in this case our hw.conf file. 2014-10-05 15:39:06,221 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:138)]Starting new configuration:{ sourceRunners:{s1=EventDrivenSourceRunner: { source:org.apache.flume.source.NetcatSource{name:s1,state:IDLE} }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@442fbe47 counterGroup:{ name:null counters:{} } }}channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} } Once all the configurations have been parsed, you will see this message, which shows you everything that was configured. You can see s1, c1, and k1, and which Java classes are actually doing the work. As you probably guessed, netcat is a convenience for org.apache.flume.source.NetcatSource. We could have used the class name if we wanted. In fact, if I had my own custom source written, I would use its class name for the source's type parameter. You cannot define your own short names without patching the Flume distribution. 2014-10-05 15:39:06,427 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)] CreatedserverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345] Here, we see that our source is now listening on port 12345 for the input. So, let's send some data to it. Finally, open a second terminal. We'll use the nc command (you can use Telnet or anything else similar) to send the Hello World string and press the Return (Enter) key to mark the end of the event: % nc localhost 12345Hello WorldOK The OK message came from the agent after we pressed the Return key, signifying that it accepted the line of text as a single Flume event. If you look at the agent log, you will see the following: 2014-10-05 15:44:11,215 (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 48 65 6C 6C 6F 20 57 6F 72 6C 64Hello World } This log message shows you that the Flume event contains no headers (NetcatSource doesn't add any itself). The body is shown in hexadecimal along with a string representation (for us humans to read, in this case, our Hello World message). If I send the following line and then press the Enter key, you'll get an OK message: The quick brown fox jumped over the lazy dog. You'll see this in the agent's log: 2014-10-05 15:44:57,232 (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]Event: { headers:{} body: 54 68 65 20 71 75 69 63 6B 20 62 72 6F 77 6E 20The quick brown } The event appears to have been truncated. The logger sink, by design, limits the body content to 16 bytes to keep your screen from being filled with more than what you'd need in a debugging context. If you need to see the full contents for debugging, you should use a different sink, perhaps the file_roll sink, which would write to the local filesystem. Summary In this article, we covered how to download the Flume binary distribution. We created a simple configuration file that included one source writing to one channel, feeding one sink. The source listened on a socket for network clients to connect to and to send it event data. These events were written to an in-memory channel and then fed to a Log4j sink to become the output. We then connected to our listening agent using the Linux netcat utility and sent some string events to our Flume agent's source. Finally, we verified that our Log4j-based sink wrote the events out. Resources for Article: Further resources on this subject: About Cassandra [article] Introducing Kafka [article] Transformation [article]
Read more
  • 0
  • 0
  • 7160
Modal Close icon
Modal Close icon