Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-getting-started-postgresql
Packt
03 Mar 2015
11 min read
Save for later

Getting Started with PostgreSQL

Packt
03 Mar 2015
11 min read
In this article by Ibrar Ahmed, Asif Fayyaz, and Amjad Shahzad, authors of the book PostgreSQL Developer's Guide, we will come across the basic features and functions of PostgreSQL, such as writing queries using psql, data definition in tables, and data manipulation from tables. (For more resources related to this topic, see here.) PostgreSQL is widely considered to be one of the most stable database servers available today, with multiple features that include: A wide range of built-in types MVCC New SQL enhancements, including foreign keys, primary keys, and constraints Open source code, maintained by a team of developers Trigger and procedure support with multiple procedural languages Extensibility in the sense of adding new data types and the client language From the early releases of PostgreSQL (from version 6.0 that is), many changes have been made, with each new major version adding new and more advanced features. The current version is PostgreSQL 9.4 and is available from several sources and in various binary formats. Writing queries using psql Before proceeding, allow me to explain to you that throughout this article, we will use a warehouse database called warehouse_db. In this section, I will show you how you can create such a database, providing you with sample code for assistance. You will need to do the following: We are assuming here that you have successfully installed PostgreSQL and faced no issues. Now, you will need to connect with the default database that is created by the PostgreSQL installer. To do this, navigate to the default path of installation, which is /opt/PostgreSQL/9.4/bin from your command line, and execute the following command that will prompt for a postgres user password that you provided during the installation: /opt/PostgreSQL/9.4/bin$./psql -U postgres Password for user postgres: Using the following command, you can log in to the default database with the user postgres and you will be able to see the following on your command line: psql (9.4beta1) Type "help" for help postgres=# You can then create a new database called warehouse_db using the following statement in the terminal: postgres=# CREATE DATABASE warehouse_db; You can then connect with the warehouse_db database using the following command: postgres=# c warehouse_db You are now connected to the warehouse_db database as the user postgres, and you will have the following warehouse_db shell: warehouse_db=# Let's summarize what we have achieved so far. We are now able to connect with the default database postgres and created a warehouse_db database successfully. It's now time to actually write queries using psql and perform some Data Definition Language (DDL) and Data Manipulation Language (DML) operations, which we will cover in the following sections. In PostgreSQL, we can have multiple databases. Inside the databases, we can have multiple extensions and schemas. Inside each schema, we can have database objects such as tables, views, sequences, procedures, and functions. We are first going to create a schema named record and then we will create some tables in this schema. To create a schema named record in the warehouse_db database, use the following statement: warehouse_db=# CREATE SCHEMA record; Creating, altering, and truncating a table In this section, we will learn about creating a table, altering the table definition, and truncating the table. Creating tables Now, let's perform some DDL operations starting with creating tables. To create a table named warehouse_tbl, execute the following statements: warehouse_db=# CREATE TABLE warehouse_tbl ( warehouse_id INTEGER NOT NULL, warehouse_name TEXT NOT NULL, year_created INTEGER, street_address TEXT, city CHARACTER VARYING(100), state CHARACTER VARYING(2), zip CHARACTER VARYING(10), CONSTRAINT "PRIM_KEY" PRIMARY KEY (warehouse_id) ); The preceding statements created the table warehouse_tbl that has the primary key warehouse_id. Now, as you are familiar with the table creation syntax, let's create a sequence and use that in a table. You can create the hist_id_seq sequence using the following statement: warehouse_db=# CREATE SEQUENCE hist_id_seq; The preceding CREATE SEQUENCE command creates a new sequence number generator. This involves creating and initializing a new special single-row table with the name hist_id_seq. The user issuing the command will own the generator. You can now create the table that implements the hist_id_seq sequence using the following statement: warehouse_db=# CREATE TABLE history ( history_id INTEGER NOT NULL DEFAULT nextval('hist_id_seq'), date TIMESTAMP WITHOUT TIME ZONE, amount INTEGER, data TEXT, customer_id INTEGER, warehouse_id INTEGER, CONSTRAINT "PRM_KEY" PRIMARY KEY (history_id), CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) ); The preceding query will create a history table in the warehouse_db database, and the history_id column uses the sequence as the default input value. In this section, we successfully learned how to create a table and also learned how to use a sequence inside the table creation syntax. Altering tables Now that we have learned how to create multiple tables, we can practice some ALTER TABLE commands by following this section. With the ALTER TABLE command, we can add, remove, or rename table columns. Firstly, with the help of the following example, we will be able to add the phone_no column in the previously created table warehouse_tbl: warehouse_db=# ALTER TABLE warehouse_tbl ADD COLUMN phone_no INTEGER; We can then verify that a column is added in the table by describing the table as follows: warehouse_db=# d warehouse_tbl            Table "public.warehouse_tbl"                  Column     |         Type         | Modifiers ----------------+------------------------+----------- warehouse_id  | integer               | not null warehouse_name | text                   | not null year_created   | integer               | street_address | text                   | city           | character varying(100) | state           | character varying(2)   | zip             | character varying(10) | phone_no       | integer               | Indexes: "PRIM_KEY" PRIMARY KEY, btree (warehouse_id) Referenced by: TABLE "history" CONSTRAINT "FORN_KEY"FOREIGN KEY  (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) TABLE  "history" CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id)  REFERENCES warehouse_tbl(warehouse_id) To drop a column from a table, we can use the following statement: warehouse_db=# ALTER TABLE warehouse_tbl DROP COLUMN phone_no; We can then finally verify that the column has been removed from the table by describing the table again as follows: warehouse_db=# d warehouse_tbl            Table "public.warehouse_tbl"                  Column     |         Type         | Modifiers ----------------+------------------------+----------- warehouse_id   | integer               | not null warehouse_name | text                   | not null year_created   | integer               | street_address | text                   | city           | character varying(100) | state           | character varying(2)   | zip             | character varying(10) | Indexes: "PRIM_KEY" PRIMARY KEY, btree (warehouse_id) Referenced by: TABLE "history" CONSTRAINT "FORN_KEY" FOREIGN KEY  (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) TABLE  "history" CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id)  REFERENCES warehouse_tbl(warehouse_id) Truncating tables The TRUNCATE command is used to remove all rows from a table without providing any criteria. In the case of the DELETE command, the user has to provide the delete criteria using the WHERE clause. To truncate data from the table, we can use the following statement: warehouse_db=# TRUNCATE TABLE warehouse_tbl; We can then verify that the warehouse_tbl table has been truncated by performing a SELECT COUNT(*) query on it using the following statement: warehouse_db=# SELECT COUNT(*) FROM warehouse_tbl; count -------      0 (1 row) Inserting, updating, and deleting data from tables In this section, we will play around with data and learn how to insert, update, and delete data from a table. Inserting data So far, we have learned how to create and alter a table. Now it's time to play around with some data. Let's start by inserting records in the warehouse_tbl table using the following command snippet: warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 1, 'Mark Corp', 2009, '207-F Main Service Road East', 'New London', 'CT', 4321 ); We can then verify that the record has been inserted by performing a SELECT query on the warehouse_tbl table as follows: warehouse_db=# SELECT warehouse_id, warehouse_name, street_address               FROM warehouse_tbl; warehouse_id | warehouse_name |       street_address         ---------------+----------------+------------------------------- >             1 | Mark Corp     | 207-F Main Service Road East (1 row) Updating data Once we have inserted data in our table, we should know how to update it. This can be done using the following statement: warehouse_db=# UPDATE warehouse_tbl SET year_created=2010 WHERE year_created=2009; To verify that a record is updated, let's perform a SELECT query on the warehouse_tbl table as follows: warehouse_db=# SELECT warehouse_id, year_created FROM               warehouse_tbl; warehouse_id | year_created --------------+--------------            1 |         2010 (1 row) Deleting data To delete data from a table, we can use the DELETE command. Let's add a few records to the table and then later on delete data on the basis of certain conditions: warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 2, 'Bill & Co', 2014, 'Lilly Road', 'New London', 'CT', 4321 ); warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 3, 'West point', 2013, 'Down Town', 'New London', 'CT', 4321 ); We can then delete data from the warehouse.tbl table, where warehouse_name is Bill & Co, by executing the following statement: warehouse_db=# DELETE FROM warehouse_tbl WHERE warehouse_name='Bill & Co'; To verify that a record has been deleted, we will execute the following SELECT query: warehouse_db=# SELECT warehouse_id, warehouse_name FROM warehouse_tbl WHERE warehouse_name='Bill & Co'; warehouse_id | warehouse_name --------------+---------------- (0 rows) The DELETE command is used to drop a row from a table, whereas the DROP command is used to drop a complete table. The TRUNCATE command is used to empty the whole table. Summary In this article, we learned how to utilize the SQL language for a collection of everyday DBMS exercises in an easy-to-use practical way. We also figured out how to make a complete database that incorporates DDL (create, alter, and truncate) and DML (insert, update, and delete) operators. Resources for Article: Further resources on this subject: Indexes [Article] Improving proximity filtering with KNN [Article] Using Unrestricted Languages [Article]
Read more
  • 0
  • 0
  • 2587

article-image-performance-considerations
Packt
03 Mar 2015
13 min read
Save for later

Performance Considerations

Packt
03 Mar 2015
13 min read
In this article by Dayong Du, the author of Apache Hive Essentials, we will look at the different performance considerations when using Hive. Although Hive is built to deal with big data, we still cannot ignore the importance of performance. Most of the time, a better Hive query can rely on the smart query optimizer to find the best execution strategy as well as the default setting best practice from vendor packages. However, as experienced users, we should learn more about the theory and practice of performance tuning in Hive, especially when working in a performance-based project or environment. We will start from utilities available in Hive to find potential issues causing poor performance. Then, we introduce the best practices of performance considerations in the areas of queries and job. (For more resources related to this topic, see here.) Performance utilities Hive provides the EXPLAIN and ANALYZE statements that can be used as utilities to check and identify the performance of queries. The EXPLAIN statement Hive provides an EXPLAIN command to return a query execution plan without running the query. We can use an EXPLAIN command for queries if we have a doubt or a concern about performance. The EXPLAIN command will help to see the difference between two or more queries for the same purpose. The syntax for EXPLAIN is as follows: EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] hive_query The following keywords can be used: EXTENDED: This provides additional information for the operators in the plan, such as file pathname and abstract syntax tree. DEPENDENCY: This provides a JSON format output that contains a list of tables and partitions that the query depends on. It is available since HIVE 0.10.0. AUTHORIZATION: This lists all entities needed to be authorized including input and output to run the Hive query and authorization failures, if any. It is available since HIVE 0.14.0. A typical query plan contains the following three sections. We will also have a look at an example later: Abstract syntax tree (AST): Hive uses a pacer generator called ANTLR (see http://www.antlr.org/) to automatically generate a tree of syntax for HQL. We can usually ignore this most of the time. Stage dependencies: This lists all dependencies and number of stages used to run the query. Stage plans: It contains important information, such as operators and sort orders, for running the job. The following is what a typical query plan looks like. From the following example, we can see that the AST section is not shown since the EXTENDED keyword is not used with EXPLAIN. In the STAGE DEPENDENCIES section, both Stage-0 and Stage-1 are independent root stages. In the STAGE PLANS section, Stage-1 has one map and reduce referred to by Map Operator Tree and Reduce Operator Tree. Inside each Map/Reduce Operator Tree section, all operators corresponding to Hive query keywords as well as expressions and aggregations are listed. The Stage-0 stage does not have map and reduce. It is just a Fetch operation. jdbc:hive2://> EXPLAIN SELECT sex_age.sex, count(*). . . . . . .> FROM employee_partitioned. . . . . . .> WHERE year=2014 GROUP BY sex_age.sex LIMIT 2;+-----------------------------------------------------------------------------+| Explain |+-----------------------------------------------------------------------------+| STAGE DEPENDENCIES: || Stage-1 is a root stage || Stage-0 is a root stage || || STAGE PLANS: || Stage: Stage-1 || Map Reduce || Map Operator Tree: || TableScan || alias: employee_partitioned || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Select Operator || expressions: sex_age (type: struct<sex:string,age:int>) || outputColumnNames: sex_age || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Group By Operator || aggregations: count() || keys: sex_age.sex (type: string) || mode: hash || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Reduce Output Operator || key expressions: _col0 (type: string) || sort order: + || Map-reduce partition columns: _col0 (type: string) || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL|| Column stats: NONE || value expressions: _col1 (type: bigint) || Reduce Operator Tree: || Group By Operator || aggregations: count(VALUE._col0) || keys: KEY._col0 (type: string) || mode: mergepartial || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || Select Operator || expressions: _col0 (type: string), _col1 (type: bigint) || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || Limit || Number of rows: 2 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || File Output Operator || compressed: false || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || table: || input format: org.apache.hadoop.mapred.TextInputFormat || output format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|| serde:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe|| || Stage: Stage-0 || Fetch Operator || limit: 2 |+-----------------------------------------------------------------------------+53 rows selected (0.26 seconds) The ANALYZE statement Hive statistics are a collection of data that describe more details, such as the number of rows, number of files, and raw data size, on the objects in the Hive database. Statistics is a metadata of Hive data. Hive supports statistics at the table, partition, and column level. These statistics serve as an input to the Hive Cost-Based Optimizer (CBO), which is an optimizer to pick the query plan with the lowest cost in terms of system resources required to complete the query. The statistics are gathered through the ANALYZE statement since Hive 0.10.0 on tables, partitions, and columns as given in the following examples: jdbc:hive2://> ANALYZE TABLE employee COMPUTE STATISTICS;No rows affected (27.979 seconds)jdbc:hive2://> ANALYZE TABLE employee_partitioned. . . . . . .> PARTITION(year=2014, month=12) COMPUTE STATISTICS;No rows affected (45.054 seconds)jdbc:hive2://> ANALYZE TABLE employee_id COMPUTE STATISTICS. . . . . . .> FOR COLUMNS employee_id;No rows affected (41.074 seconds) Once the statistics are built, we can check the statistics by the DESCRIBE EXTENDED/FORMATTED statement. From the table/partition output, we can find the statistics information inside the parameters, such as parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}). The following is an example: jdbc:hive2://> DESCRIBE EXTENDED employee_partitioned. . . . . . .> PARTITION(year=2014, month=12);jdbc:hive2://> DESCRIBE EXTENDED employee;…parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}).jdbc:hive2://> DESCRIBE FORMATTED employee.name;+--------+---------+---+---+---------+--------------+-----------+-----------+|col_name|data_type|min|max|num_nulls|distinct_count|avg_col_len|max_col_len|+--------+---------+---+---+---------+--------------+-----------+-----------+| name | string | | | 0 | 5 | 5.6 | 7 |+--------+---------+---+---+---------+--------------+-----------+-----------++---------+----------+-----------------+|num_trues|num_falses| comment |+---------+----------+-----------------+| | |from deserializer|+---------+----------+-----------------+3 rows selected (0.116 seconds) Hive statistics are persisted in the metastore to avoid computing them every time. For newly created tables and/or partitions, statistics are automatically computed by default if we enable the following setting: jdbc:hive2://> SET hive.stats.autogather=ture; Hive logs Logs provide useful information to find out how a Hive query/job runs. By checking the Hive logs, we can identify runtime problems and issues that may cause bad performance. There are two types of logs available in Hive: system log and job log. The system log contains the Hive running status and issues. It is configured in {HIVE_HOME}/conf/hive-log4j.properties. The following three lines for Hive log can be found: hive.root.logger=WARN,DRFAhive.log.dir=/tmp/${user.name}hive.log.file=hive.log To modify the status, we can either modify the preceding lines in hive-log4j.properties (applies to all users) or set from the Hive CLI (only applies to the current user and current session) as follows: hive --hiveconf hive.root.logger=DEBUG,console The job log contains Hive query information and is saved at the same place, /tmp/${user.name}, by default as one file for each Hive user session. We can override it in hive-site.xml with the hive.querylog.location property. If a Hive query generates MapReduce jobs, those logs can also be viewed through the Hadoop JobTracker Web UI. Job and query optimization Job and query optimization covers experience and skills to improve performance in the area of job-running mode, JVM reuse, job parallel running, and query optimizations in JOIN. Local mode Hadoop can run in standalone, pseudo-distributed, and fully distributed mode. Most of the time, we need to configure Hadoop to run in fully distributed mode. When the data to process is small, it is an overhead to start distributed data processing since the launching time of the fully distributed mode takes more time than the job processing time. Since Hive 0.7.0, Hive supports automatic conversion of a job to run in local mode with the following settings: jdbc:hive2://> SET hive.exec.mode.local.auto=true; --default falsejdbc:hive2://> SET hive.exec.mode.local.auto.inputbytes.max=50000000;jdbc:hive2://> SET hive.exec.mode.local.auto.input.files.max=5;--default 4 A job must satisfy the following conditions to run in the local mode: The total input size of the job is lower than hive.exec.mode.local.auto.inputbytes.max The total number of map tasks is less than hive.exec.mode.local.auto.input.files.max The total number of reduce tasks required is 1 or 0 JVM reuse By default, Hadoop launches a new JVM for each map or reduce job and runs the map or reduce task in parallel. When the map or reduce job is a lightweight job running only for a few seconds, the JVM startup process could be a significant overhead. The MapReduce framework (version 1 only, not Yarn) has an option to reuse JVM by sharing the JVM to run mapper/reducer serially instead of parallel. JVM reuse applies to map or reduce tasks in the same job. Tasks from different jobs will always run in a separate JVM. To enable the reuse, we can set the maximum number of tasks for a single job for JVM reuse using the mapred.job.reuse.jvm.num.tasks property. Its default value is 1: jdbc:hive2://> SET mapred.job.reuse.jvm.num.tasks=5; We can also set the value to –1 to indicate that all the tasks for a job will run in the same JVM. Parallel execution Hive queries commonly are translated into a number of stages that are executed by the default sequence. These stages are not always dependent on each other. Instead, they can run in parallel to save the overall job running time. We can enable this feature with the following settings: jdbc:hive2://> SET hive.exec.parallel=true; -- default falsejdbc:hive2://> SET hive.exec.parallel.thread.number=16;-- default 8, it defines the max number for running in parallel Parallel execution will increase the cluster utilization. If the utilization of a cluster is already very high, parallel execution will not help much in terms of overall performance. Join optimization Here, we'll briefly review the key settings for join improvement. Common join The common join is also called reduce side join. It is a basic join in Hive and works for most of the time. For common joins, we need to make sure the big table is on the right-most side or specified by hit, as follows: /*+ STREAMTABLE(stream_table_name) */. Map join Map join is used when one of the join tables is small enough to fit in the memory, so it is very fast but limited. Since Hive 0.7.0, Hive can convert map join automatically with the following settings: jdbc:hive2://> SET hive.auto.convert.join=true; --default falsejdbc:hive2://> SET hive.mapjoin.smalltable.filesize=600000000;--default 25Mjdbc:hive2://> SET hive.auto.convert.join.noconditionaltask=true;--default false. Set to true so that map join hint is not needed jdbc:hive2://> SET hive.auto.convert.join.noconditionaltask.size=10000000;--The default value controls the size of table to fit in memory Once autoconvert is enabled, Hive will automatically check if the smaller table file size is bigger than the value specified by hive.mapjoin.smalltable.filesize, and then Hive will convert the join to a common join. If the file size is smaller than this threshold, it will try to convert the common join into a map join. Once autoconvert join is enabled, there is no need to provide the map join hints in the query. Bucket map join Bucket map join is a special type of map join applied on the bucket tables. To enable bucket map join, we need to enable the following settings: jdbc:hive2://> SET hive.auto.convert.join=true; --default falsejdbc:hive2://> SET hive.optimize.bucketmapjoin=true; --default false In bucket map join, all the join tables must be bucket tables and join on buckets columns. In addition, the buckets number in bigger tables must be a multiple of the bucket number in the small tables. Sort merge bucket (SMB) join SMB is the join performed on the bucket tables that have the same sorted, bucket, and join condition columns. It reads data from both bucket tables and performs common joins (map and reduce triggered) on the bucket tables. We need to enable the following properties to use SMB: jdbc:hive2://> SET hive.input.format=. . . . . . .> org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;jdbc:hive2://> SET hive.auto.convert.sortmerge.join=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin.sortedmerge=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.noconditionaltask=true; Sort merge bucket map (SMBM) join SMBM join is a special bucket join but triggers map-side join only. It can avoid caching all rows in the memory like map join does. To perform SMBM joins, the join tables must have the same bucket, sort, and join condition columns. To enable such joins, we need to enable the following settings: jdbc:hive2://> SET hive.auto.convert.join=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join=truejdbc:hive2://> SET hive.optimize.bucketmapjoin=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin.sortedmerge=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.noconditionaltask=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.bigtable.selection.policy=org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ; Skew join When working with data that has a highly uneven distribution, the data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation. The following setting informs Hive to optimize properly if data skew happens: jdbc:hive2://> SET hive.optimize.skewjoin=true;--If there is data skew in join, set it to true. Default is false. jdbc:hive2://> SET hive.skewjoin.key=100000;--This is the default value. If the number of key is bigger than--this, the new keys will send to the other unused reducers. Skew data could happen on the GROUP BY data too. To optimize it, we need to do the following settings to enable skew data optimization in the GROUP BY result: SET hive.groupby.skewindata=true; Once configured, Hive will first trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew. For more information about Hive join optimization, please refer to the Apache Hive wiki available at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization and https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization. Summary In this article, we first covered how to identify performance bottlenecks using the EXPLAIN and ANALYZE statements. Then, we discussed job and query optimization in Hive. Resources for Article: Further resources on this subject: Apache Maven and m2eclipse [Article] Apache Karaf – Provisioning and Clusters [Article] Introduction to Apache ZooKeeper [Article]
Read more
  • 0
  • 0
  • 2339

article-image-introducing-splunk
Packt
03 Mar 2015
14 min read
Save for later

Introducing Splunk

Packt
03 Mar 2015
14 min read
In this article by Betsy Page Sigman, author of the book Splunk Essentials, Splunk, whose "name was inspired by the process of exploring caves, or splunking, helps analysts, operators, programmers, and many others explore data from their organizations by obtaining, analyzing, and reporting on it. This multinational company, cofounded by Michael Baum, Rob Das, and Erik Swan, has a core product called "Splunk Enterprise. This manages searches, inserts, deletes, and filters, and analyzes big data that is generated by machines, as well as other types of data. "They also have a free version that has most of the capabilities of Splunk Enterprise and is an excellent learning tool. (For more resources related to this topic, see here.) Understanding events, event types, and fields in Splunk An understanding of events and event types is important before going further. Events In Splunk, an event is not just one of" the many local user meetings that are set up between developers to help each other out (although those can be very useful), "but also refers to a record of one activity that is recorded in a log file. Each event usually has: A timestamp indicating the date and exact time the event was created Information about what happened on the system that is being tracked Event types An event type is a way to allow "users to categorize similar events. It is field-defined by the user. You can define an event type in several ways, and the easiest way is by using the SplunkWeb interface. One common reason for setting up an event type is to examine why a system has failed. Logins are often problematic for systems, and a search for failed logins can help pinpoint problems. For an interesting example of how to save "a search on failed logins as an event type, visit http://docs.splunk.com/Documentation/Splunk/6.1.3/Knowledge/ClassifyAndGroupSimilarEvents#Save_a_search_as_a_new_event_type. Why are events and event types so important in Splunk? Because without events, there would be nothing to search, of course. And event types allow us to make meaningful searches easily and quickly according to our needs, as we'll see later. Sourcetypes Sourcetypes are also "important to understand, as they help define the rules for an event. A sourcetype is one of the default fields that Splunk assigns to data as it comes into the system. It determines what type of data it is so that Splunk can format it appropriately as it indexes it. This also allows the user who wants to search the "data to easily categorize it. Some of the common sourcetypes are listed as follows: access_combined, for "NCSA combined format HTTP web server logs apache_error, for standard "Apache web server error logs cisco_syslog, for the "standard syslog produced by Cisco network devices (including PIX firewalls, routers, and ACS), usually via remote syslog to a central log host websphere_core, a core file" export from WebSphere (Source: http://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter) Fields Each event in Splunk is" associated with a number of fields. The core fields of host, course, sourcetype, and timestamp are key to Splunk. These fields are extracted from events at multiple points in the data processing pipeline that Splunk uses, and each of these fields includes a name and a value. The name describes the field (such as the userid) and the value says what that field's value is (susansmith, for example). Some of these fields are default fields that are given because of where the event came from or what it is. When data is processed by Splunk, and when it is indexed or searched, it uses these fields. For indexing, the default fields added include those of host, source, and sourcetype. When searching, Splunk is able to select from a bevy of fields that can either be defined by the user or are very basic, such as action results in a purchase (for a website event). Fields are essential for doing the basic work of Splunk – that is, indexing and searching. Getting data into Splunk It's time to spring into action" now and input some data into Splunk. Adding data is "simple, easy, and quick. In this section, we will use some data and tutorials created by Splunk to learn how to add data: Firstly, to obtain your data, visit the tutorial data at http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk that is readily available on Splunk. Here, download the folder tutorialdata.zip. Note that this will be a fresh dataset that has been collected over the last 7 days. Download it but don't extract the data from it just yet. You then need to log in to Splunk, using admin as the username and then by using your password. Once logged in, you will notice that toward the upper-right corner of your screen is the button Add Data, as shown in the following screenshot. Click "on this button: Button to Add Data Once you have "clicked on this button, you'll see a screen" similar to the "following screenshot: Add Data to Splunk by Choosing a Data Type or Data Source Notice here the "different types of data that you can select, as "well as the different data sources. Since the data we're going to use is a file, under "Or Choose a Data Source, click on From files and directories. Once you have clicked on this, you can then click on the radio button next to Skip preview, as indicated in the following screenshot, since you don't need to preview the data" now. You then need to click on "Continue: Preview data You can download the tutorial files at: http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk As shown in the next screenshot, click on Upload and index a file, find the tutorialdata.zip file you just downloaded (it is probably in your Downloads folder), and then click on More settings, filling it in as shown in the following screenshot. (Note that you will need to select Segment in path under Host and type 1 under Segment Number.) Click on Save when you are done: Can specify source, additional settings, and source type Following this, you "should see a screen similar to the following" screenshot. Click on Start Searching, we will look at the data now: You should see this if your data has been successfully indexed into Splunk. You will now" see a screen similar to the following" screenshot. Notice that the number of events you have will be different, as will the time of the earliest event. At this point, click on Data Summary: The Search screen You should see the Data Summary screen like in the following screenshot. However, note that the Hosts shown here will not be the same as the ones you get. Take a quick look at what is on the Sources tab and the Sourcetypes tab. Then find the most recent data (in this case 127.0.0.1) and click on it. Data Summary, where you can see Hosts, Sources, and Sourcetypes After" clicking on the most recent data, which in "this case is bps-T341s, look at the events contained there. Later, when we use streaming data, we can see how the events at the top of this list change rapidly. Here, you will see a listing of events, similar to those shown in the "following screenshot: Events lists for the host value You can click on the Splunk logo in the upper-left corner "of the web page to return to the home page. Under Administrator at the "top-right of the page, click on Logout. Searching Twitter data We will start here by doing a simple search of our Twitter index, which is automatically created by the app once you have enabled Twitter input (as explained previously). In our earlier searches, we used the default index (which the tutorial data was downloaded to), so we didn't have to specify the index we wanted to use. Here, we will use just the Twitter index, so we need to specify that in the search. A simple search Imagine that we wanted to search for tweets containing the word coffee. We could use the code presented here and place it in the search bar: index=twitter text=*coffee* The preceding code searches only your Twitter index and finds all the places where the word coffee is mentioned. You have to put asterisks there, otherwise you will only get the tweets with just "coffee". (Note that the text field is not case sensitive, so tweets with either "coffee" or "Coffee" will be included in the search results.) The asterisks are included before and after the text "coffee" because otherwise we would only get events where just "coffee" was tweeted – a rather rare occurrence, we expect. In fact, when we search our indexed Twitter data without the asterisks around coffee, we got no results. Examining the Twitter event Before going further, it is useful to stop and closely examine the events that are collected as part of the search. The sample tweet shown in the following screenshot shows the large number of fields that are part of each tweet. The > was clicked to expand the event: A Twitter event There are several items to look closely at here: _time: Splunk assigns a timestamp for every event. This is done in UTC (Coordinated Universal Time) time format. contributors: The value for this field is null, as are the values of many Twitter fields. Retweeted_status: Notice the {+} here; in the following event list, you will see there are a number of fields associated with this, which can be seen when the + is selected and the list is expanded. This is the case wherever you see a {+} in a list of fields: Various retweet fields In addition to those shown previously, there are many other fields associated with a tweet. The 140 character (maximum) text field that most people consider to be the tweet is actually a small part of the actual data collected. The implied AND If you want to search on more than one term, there is no need to add AND as it is already implied. If, for example, you want to search for all tweets that include both the text "coffee" and the text "morning", then use: index=twitter text=*coffee* text=*morning* If you don't specify text= for the second term and just put *morning*, Splunk assumes that you want to search for *morning* in any field. Therefore, you could get that word in another field in an event. This isn't very likely in this case, although coffee could conceivably be part of a user's name, such as "coffeelover". But if you were searching for other text strings, such as a computer term like log or error, such terms could be found in a number of fields. So specifying the field you are interested in would be very important. The need to specify OR Unlike AND, you must always specify the word OR. For example, to obtain all events that mention either coffee or morning, enter: index=twitter text=*coffee* OR text=*morning* Finding other words used Sometimes you might want to find out what other words are used in tweets about coffee. You can do that with the following search: index=twitter text=*coffee* | makemv text | mvexpand text | top 30 text This search first searches for the word "coffee" in a text field, then creates a multivalued field from the tweet, and then expands it so that each word is treated as a separate piece of text. Then it takes the top 30 words that it finds. You might be asking yourself how you would use this kind of information. This type of analysis would be of interest to a marketer, who might want to use words that appear to be associated with coffee in composing the script for an advertisement. The following screenshot shows the results that appear (1 of 2 pages). From this search, we can see that the words love, good, and cold might be words worth considering: Search of top 30 text fields found with *coffee* When you do a search like this, you will notice that there are a lot of filler words (a, to, for, and so on) that appear. You can do two things to remedy this. You can increase the limit for top words so that you can see more of the words that come up, or you can rerun the search using the following code. "Coffee" (with a capital C) is listed (on the unshown second page) separately here from "coffee". The reason for this is that while the search is not case sensitive (thus both "coffee" and "Coffee" are picked up when you search on "coffee"), the process of putting the text fields through the makemv and the mvexpand processes ends up distinguishing on the basis of case. We could rerun the search, excluding some of the filler words, using the code shown here: index=twitter text=*coffee* | makemv text | mvexpand text |search NOT text="RT" AND NOT text="a" AND NOT text="to" ANDNOT text="the" | top 30 text Using a lookup table Sometimes it is useful to use a lookup file to avoid having to use repetitive code. It would help us to have a list of all the small words that might be found often in a tweet just by the nature of each word's frequent use in language, so that we might eliminate them from our quest to find words that would be relevant for use in the creation of advertising. If we had a file of such small words, we could use a command indicating not to use any of these more common, irrelevant words when listing the top 30 words associated with our search topic of interest. Thus, for our search for words associated with the text "coffee", we would be interested in words like " dark", "flavorful", and "strong", but not words like "a", "the", and "then". We can do this using a lookup command. There are three types of lookup commands, which are presented in the following table: Command Description lookup Matches a value of one field with a value of another, based on a .csv file with the two fields. Consider a lookup table named lutable that contains fields for machine_name and owner. Consider what happens when the following code snippet is used after a preceding search (indicated by . . . |): . . . | lookup lutable owner Splunk will use the lookup table to match the owner's name with its machine_name and add the machine_name to each event. inputlookup All fields in the .csv file are returned as results. If the following code snippet is used, both machine_name and owner would be searched: . . . | inputlookup lutable outputlookup This code outputs search results to a lookup table. The following code outputs results from the preceding research directly into a table it creates: . . . | outputlookup newtable.csv saves The command we will use here is inputlookup, because we want to reference a .csv file we can create that will include words that we want to filter out as we seek to find possible advertising words associated with coffee. Let's call the .csv file filtered_words.csv, and give it just a single text field, containing words like "is", "the", and "then". Let's rewrite the search to look like the following code: index=twitter text=*coffee*| makemv text | mvexpand text| search NOT [inputlookup filtered_words | fields text ]| top 30 text Using the preceding code, Splunk will search our Twitter index for *coffee*, and then expand the text field so that individual words are separated out. Then it will look for words that do NOT match any of the words in our filtered_words.csv file, and finally output the top 30 most frequently found words among those. As you can see, the lookup table can be very useful. To learn more about Splunk lookup tables, go to http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchReference/Lookup. Summary In this article, we have learned more about how to use Splunk to create reports, dashboards. Splunk Enterprise Software, or Splunk, is an extremely powerful tool for searching, exploring, and visualizing data of all types. Splunk is becoming increasingly popular, as more and more businesses, both large and small, discover its ease and usefulness. Analysts, managers, students, and others can quickly learn how to use the data from their systems, networks, web traffic, and social media to make attractive and informative reports. This is a straightforward, practical, and quick introduction to Splunk that should have you making reports and gaining insights from your data in no time. Resources for Article: Further resources on this subject: Lookups [article] Working with Apps in Splunk [article] Loading data, creating an app, and adding dashboards and reports in Splunk [article]
Read more
  • 0
  • 0
  • 11723

article-image-time-travelling-spring
Packt
03 Mar 2015
18 min read
Save for later

Time Travelling with Spring

Packt
03 Mar 2015
18 min read
This article by Sujoy Acharya, the author of the book Mockito for Spring, delves into the details Time Travelling with Spring. Spring 4.0 is the Java 8-enabled latest release of the Spring Framework. In this article, we'll discover the major changes in the Spring 4.x release and the four important features of the Spring 4 framework. We will cover the following topics in depth: @RestController AsyncRestTemplate Async tasks Caching (For more resources related to this topic, see here.) Discovering the new Spring release This section deals with the new features and enhancements in Spring Framework 4.0. The following are the features: Spring 4 supports Java 8 features such as Java lambda expressions and java.time. Spring 4 supports JDK 6 as the minimum. All deprecated packages/methods are removed. Java Enterprise Edition 6 or 7 are the base of Spring 4, which is based on JPA 2 and Servlet 3.0. Bean configuration using the Groovy DSL is supported in Spring Framework 4.0. Hibernate 4.3 is supported by Spring 4. Custom annotations are supported in Spring 4. Autowired lists and arrays can be ordered. The @Order annotation and the Ordered interface are supported. The @Lazy annotation can now be used on injection points as well as on the @Bean definitions. For the REST application, Spring 4 provides a new @RestController annotation. We will discuss this in detail in the following section. The AsyncRestTemplate feature (class) is added for asynchronous REST client development. Different time zones are supported in Spring 4.0. New spring-websocket and spring-messaging modules have been added. The SocketUtils class is added to examine the free TCP and UDP server ports on localhost. All the mocks under the org.springframework.mock.web package are now based on the Servlet 3.0 specification. Spring supports JCache annotations and new improvements have been made in caching. The @Conditional annotation has been added to conditionally enable or disable an @Configuration class or even individual @Bean methods. In the test module, SQL script execution can now be configured declaratively via the new @Sql and @SqlConfig annotations on a per-class or per-method basis. You can visit the Spring Framework reference at http://docs.spring.io/spring/docs/4.1.2.BUILD-SNAPSHOT/spring-framework-reference/htmlsingle/#spring-whats-new for more details. Also, you can watch a video at http://zeroturnaround.com/rebellabs/spring-4-on-java-8-geekout-2013-video/ for more details on the changes in Spring 4. Working with asynchronous tasks Java 7 has a feature called Future. Futures let you retrieve the result of an asynchronous operation at a later time. The FutureTask class runs in a separate thread, which allows you to perform non-blocking asynchronous operations. Spring provides an @Async annotation to make it more easier to use. We'll explore Java's Future feature and Spring's @Async declarative approach: Create a project, TimeTravellingWithSpring, and add a package, com.packt.async. We'll exercise a bank's use case, where an automated job will run and settle loan accounts. It will also find all the defaulters who haven't paid the loan EMI for a month and then send an SMS to their number. The job takes time to process thousands of accounts, so it will be good if we can send SMSes asynchronously to minimize the burden of the job. We'll create a service class to represent the job, as shown in the following code snippet: @Service public class AccountJob {    @Autowired    private SMSTask smsTask; public void process() throws InterruptedException, ExecutionException { System.out.println("Going to find defaulters... "); Future<Boolean> asyncResult =smsTask.send("1", "2", "3"); System.out.println("Defaulter Job Complete. SMS will be sent to all defaulter"); Boolean result = asyncResult.get(); System.out.println("Was SMS sent? " + result); } } The job class autowires an SMSTask class and invokes the send method with phone numbers. The send method is executed asynchronously and Future is returned. When the job calls the get() method on Future, a result is returned. If the result is not processed before the get() method invocation, the ExecutionException is thrown. We can use a timeout version of the get() method. Create the SMSTask class in the com.packt.async package with the following details: @Component public class SMSTask { @Async public Future<Boolean> send(String... numbers) { System.out.println("Selecting SMS format "); try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); return new AsyncResult<>(false); } System.out.println("Async SMS send task is Complete!!!"); return new AsyncResult<>(true); } } Note that the method returns Future, and the method is annotated with @Async to signify asynchronous processing. Create a JUnit test to verify asynchronous processing: @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(locations="classpath:com/packt/async/          applicationContext.xml") public class AsyncTaskExecutionTest { @Autowired ApplicationContext context; @Test public void jobTest() throws Exception { AccountJob job = (AccountJob)context.getBean(AccountJob.class); job.process(); } } The job bean is retrieved from the applicationContext file and then the process method is called. When we execute the test, the following output is displayed: Going to find defaulters... Defaulter Job Complete. SMS will be sent to all defaulter Selecting SMS format Async SMS send task is Complete!!! Was SMS sent? true During execution, you might feel that the async task is executed after a delay of 2 seconds as the SMSTask class waits for 2 seconds. Exploring @RestController JAX-RS provides the functionality for Representational State Transfer (RESTful) web services. REST is well-suited for basic, ad hoc integration scenarios. Spring MVC offers controllers to create RESTful web services. In Spring MVC 3.0, we need to explicitly annotate a class with the @Controller annotation in order to specify a controller servlet and annotate each and every method with @ResponseBody to serve JSON, XML, or a custom media type. With the advent of the Spring 4.0 @RestController stereotype annotation, we can combine @ResponseBody and @Controller. The following example will demonstrate the usage of @RestController: Create a dynamic web project, RESTfulWeb. Modify the web.xml file and add a configuration to intercept requests with a Spring DispatcherServlet: <web-app xsi_schemaLocation="http:// java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/webapp_ 3_0.xsd" id="WebApp_ID" version="3.0"> <display-name>RESTfulWeb</display-name> <servlet> <servlet-name>dispatcher</servlet-name> <servlet-class> org.springframework.web.servlet.DispatcherServlet </servlet-class> <load-on-startup>1</load-on-startup> </servlet> <servlet-mapping> <servlet-name>dispatcher</servlet-name> <url-pattern>/</url-pattern> </servlet-mapping> <context-param> <param-name>contextConfigLocation</param-name> <param-value> /WEB-INF/dispatcher-servlet.xml </param-value> </context-param> </web-app> The DispatcherServlet expects a configuration file with the naming convention [servlet-name]-servlet.xml. Create an application context XML, dispatcher-servlet.xml. We'll use annotations to configure Spring beans, so we need to tell the Spring container to scan the Java package in order to craft the beans. Add the following lines to the application context in order to instruct the container to scan the com.packt.controller package: <context:component-scan base-package= "com.packt.controller" /> <mvc:annotation-driven /> We need a REST controller class to handle the requests and generate a JSON output. Go to the com.packt.controller package and add a SpringService controller class. To configure the class as a REST controller, we need to annotate it with the @RestController annotation. The following code snippet represents the class: @RestController @RequestMapping("/hello") public class SpringService { private Set<String> names = new HashSet<String>(); @RequestMapping(value = "/{name}", method =          RequestMethod.GET) public String displayMsg(@PathVariable String name) {    String result = "Welcome " + name;    names.add(name);    return result; } @RequestMapping(value = "/all/", method =          RequestMethod.GET) public String anotherMsg() {    StringBuilder result = new StringBuilder("We          greeted so far ");    for(String name:names){      result.append(name).append(", ");    }    return result.toString();  } } We annotated the class with @RequestMapping("/hello"). This means that the SpringService class will cater for the requests with the http://{site}/{context}/hello URL pattern, or since we are running the app in localhost, the URL can be http://localhost:8080/RESTfulWeb/hello. The displayMsg method is annotated with @RequestMapping(value = "/{name}", method = RequestMethod.GET). So, the method will handle all HTTP GET requests with the URL pattern /hello/{name}. The name can be any String, such as /hello/xyz or /hello/john. In turn, the method stores the name to Set for later use and returns a greeting message, welcome {name}. The anotherMsg method is annotated with @RequestMapping(value = "/all/", method = RequestMethod.GET), which means that the method accepts all the requests with the http://{SITE}/{Context}/hello/all/ URL pattern. Moreover, this method builds a list of all users who visited the /hello/{names} URL. Remember, the displayMsg method stores the names in Set; this method iterates Set and builds a list of names who visited the /hello/{name} URL. There is some confusion though: what will happen if you enter the /hello/all URL in the browser? When we pass only a String literal after /hello/, the displayMsg method handles it, so you will be greeted with welcome all. However, if you type /hello/all/ instead—note that we added a slash after all—it means that the URL does not match the /hello/{name} pattern and the second method will handle the request and show you the list of users who visited the first URL. When we run the application and access the /hello/{name} URL, the following output is displayed: When we access http://localhost:8080/RESTfulWeb/hello/all/, the following output is displayed: Therefore, our RESTful application is ready for use, but just remember that in the real world, you need to secure the URLs against unauthorized access. In a web service, development security plays a key role. You can read the Spring security reference manual for additional information. Learning AsyncRestTemplate We live in a small, wonderful world where everybody is interconnected and impatient! We are interconnected through technology and applications, such as social networks, Internet banking, telephones, chats, and so on. Likewise, our applications are interconnected; often, an application housed in India may need to query an external service hosted in Philadelphia to get some significant information. We are impatient as we expect everything to be done in seconds; we get frustrated when we make an HTTP call to a remote service, and this blocks the processing unless the remote response is back. We cannot finish everything in milliseconds or nanoseconds, but we can process long-running tasks asynchronously or in a separate thread, allowing the user to work on something else. To handle RESTful web service calls asynchronously, Spring offers two useful classes: AsyncRestTemplate and ListenableFuture. We can make an async call using the template and get Future back and then continue with other processing, and finally we can ask Future to get the result. This section builds an asynchronous RESTful client to query the RESTful web service we developed in the preceding section. The AsyncRestTemplate class defines an array of overloaded methods to access RESTful web services asynchronously. We'll explore the exchange and execute methods. The following are the steps to explore the template: Create a package, com.packt.rest.template. Add a AsyncRestTemplateTest JUnit test. Create an exchange() test method and add the following lines: @Test public void exchange(){ AsyncRestTemplate asyncRestTemplate = new AsyncRestTemplate(); String url ="http://localhost:8080/RESTfulWeb/ hello/all/"; HttpMethod method = HttpMethod.GET; Class<String> responseType = String.class; HttpHeaders headers = new HttpHeaders(); headers.setContentType(MediaType.TEXT_PLAIN); HttpEntity<String> requestEntity = new HttpEntity<String>("params", headers); ListenableFuture<ResponseEntity<String>> future = asyncRestTemplate.exchange(url, method, requestEntity, responseType); try { //waits for the result ResponseEntity<String> entity = future.get(); //prints body of the given URL System.out.println(entity.getBody()); } catch (InterruptedException e) { e.printStackTrace(); } catch (ExecutionException e) { e.printStackTrace(); } } The exchange() method has six overloaded versions. We used the method that takes a URL, an HttpMethod method such as GET or POST, an HttpEntity method to set the header, and finally a response type class. We called the exchange method, which in turn called the execute method and returned ListenableFuture. The ListenableFuture is the handle to our output; we invoked the GET method on ListenableFuture to get the RESTful service call response. The ResponseEntity has the getBody, getClass, getHeaders, and getStatusCode methods for extracting the web service call response. We invoked the http://localhost:8080/RESTfulWeb/hello/all/ URL and got back the following response: Now, create an execute test method and add the following lines: @Test public void execute(){ AsyncRestTemplate asyncTemp = new AsyncRestTemplate(); String url ="http://localhost:8080/RESTfulWeb /hello/reader"; HttpMethod method = HttpMethod.GET; HttpHeaders headers = new HttpHeaders(); headers.setContentType(MediaType.TEXT_PLAIN); AsyncRequestCallback requestCallback = new AsyncRequestCallback (){ @Override public void doWithRequest(AsyncClientHttpRequest request) throws IOException { System.out.println(request.getURI()); } }; ResponseExtractor<String> responseExtractor = new ResponseExtractor<String>(){ @Override public String extractData(ClientHttpResponse response) throws IOException { return response.getStatusText(); } }; Map<String,String> urlVariable = new HashMap<String, String>(); ListenableFuture<String> future = asyncTemp.execute(url, method, requestCallback, responseExtractor, urlVariable); try { //wait for the result String result = future.get(); System.out.println("Status =" +result); } catch (InterruptedException e) { e.printStackTrace(); } catch (ExecutionException e) { e.printStackTrace(); } } The execute method has several variants. We invoke the one that takes a URL, HttpMethod such as GET or POST, an AsyncRequestCallback method which is invoked from the execute method just before executing the request asynchronously, a ResponseExtractor to extract the response, such as a response body, status code or headers, and a URL variable such as a URL that takes parameters. We invoked the execute method and received a future, as our ResponseExtractor extracts the status code. So, when we ask the future to get the result, it returns the response status which is OK or 200. In the AsyncRequestCallback method, we invoked the request URI; hence, the output first displays the request URI and then prints the response status. The following is the output: Caching objects Scalability is a major concern in web application development. Generally, most web traffic is focused on some special set of information. So, only those records are queried very often. If we can cache these records, then the performance and scalability of the system will increase immensely. The Spring Framework provides support for adding caching into an existing Spring application. In this section, we'll work with EhCache, the most widely used caching solution. Download the latest EhCache JAR from the Maven repository; the URL to download version 2.7.2 is http://mvnrepository.com/artifact/net.sf.ehcache/ehcache/2.7.2. Spring provides two annotations for caching: @Cacheable and @CacheEvict. These annotations allow methods to trigger cache population or cache eviction, respectively. The @Cacheable annotation is used to identify a cacheable method, which means that for an annotate method the result is stored into the cache. Therefore, on subsequent invocations (with the same arguments), the value in the cache is returned without actually executing the method. The cache abstraction allows the eviction of cache for removing stale or unused data from the cache. The @CacheEvict annotation demarcates the methods that perform cache eviction, that is, methods that act as triggers to remove data from the cache. The following are the steps to build a cacheable application with EhCache: Create a serializable Employee POJO class in the com.packt.cache package to store the employee ID and name. The following is the class definition: public class Employee implements Serializable { private static final long serialVersionUID = 1L; private final String firstName, lastName, empId;   public Employee(String empId, String fName, String lName) {    this.firstName = fName;    this.lastName = lName;    this.empId = empId; //Getter methods Spring caching supports two storages: the ConcurrentMap and ehcache libraries. To configure caching, we need to configure a manager in the application context. The org.springframework.cache.ehcache.EhCacheCacheManager class manages ehcache. Then, we need to define a cache with a configurationLocation attribute. The configurationLocation attribute defines the configuration resource. The ehcache-specific configuration is read from the resource ehcache.xml. <beans   xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans- 4.1.xsd http://www.springframework.org/schema/cache http://www. springframework.org/schema/cache/spring-cache- 4.1.xsd http://www.springframework.org/schema/context http://www. springframework.org/schema/context/springcontext- 4.1.xsd "> <context:component-scan base-package= "com.packt.cache" /> <cache:annotation-driven/> <bean id="cacheManager" class="org.springframework.cache. ehcache.EhCacheCacheManager" p:cacheManager-ref="ehcache"/> <bean id="ehcache" class="org.springframework.cache. ehcache.EhCacheManagerFactoryBean" p:configLocation="classpath:com/packt/cache/ehcache.xml"/> </beans> The <cache:annotation-driven/> tag informs the Spring container that the caching and eviction is performed in annotated methods. We defined a cacheManager bean and then defined an ehcache bean. The ehcache bean's configLocation points to an ehcache.xml file. We'll create the file next. Create an XML file, ehcache.xml, under the com.packt.cache package and add the following cache configuration data: <ehcache>    <diskStore path="java.io.tmpdir"/>    <cache name="employee"            maxElementsInMemory="100"            eternal="false"            timeToIdleSeconds="120"            timeToLiveSeconds="120"            overflowToDisk="true"            maxElementsOnDisk="10000000"            diskPersistent="false"            diskExpiryThreadIntervalSeconds="120"            memoryStoreEvictionPolicy="LRU"/>   </ehcache> The XML configures many things. Cache is stored in memory, but memory has a limit, so we need to define maxElementsInMemory. EhCache needs to store data to disk when max elements in memory reaches the threshold limit. We provide diskStore for this purpose. The eviction policy is set as an LRU, but the most important thing is the cache name. The name employee will be used to access the cache configuration. Now, create a service to store the Employee objects in a HashMap. The following is the service: @Service public class EmployeeService { private final Map<String, Employee> employees = new ConcurrentHashMap<String, Employee>(); @PostConstruct public void init() { saveEmployee (new Employee("101", "John", "Doe")); saveEmployee (new Employee("102", "Jack", "Russell")); } @Cacheable("employee") public Employee getEmployee(final String employeeId) { System.out.println(String.format("Loading a employee with id of : %s", employeeId)); return employees.get(employeeId); } @CacheEvict(value = "employee", key = "#emp.empId") public void saveEmployee(final Employee emp) { System.out.println(String.format("Saving a emp with id of : %s", emp.getEmpId())); employees.put(emp.getEmpId(), emp); } } The getEmployee method is a cacheable method; it uses the cache employee. When the getEmployee method is invoked more than once with the same employee ID, the object is returned from the cache instead of the original method being invoked. The saveEmployee method is a CacheEvict method. Now, we'll examine caching. We'll call the getEmployee method twice; the first call will populate the cache and the subsequent call will be responded toby the cache. Create a JUnit test, CacheConfiguration, and add the following lines: @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(locations="classpath:com/packt/cache/ applicationContext.xml") public class CacheConfiguration { @Autowired ApplicationContext context; @Test public void jobTest() throws Exception { EmployeeService employeeService = (EmployeeService)context.getBean(EmployeeService.class); long time = System.currentTimeMillis(); employeeService.getEmployee("101"); System.out.println("time taken ="+(System.currentTimeMillis() - time)); time = System.currentTimeMillis(); employeeService.getEmployee("101"); System.out.println("time taken to read from cache ="+(System.currentTimeMillis() - time)); time = System.currentTimeMillis(); employeeService.getEmployee("102"); System.out.println("time taken ="+(System.currentTimeMillis() - time)); time = System.currentTimeMillis(); employeeService.getEmployee("102"); System.out.println("time taken to read from cache ="+(System.currentTimeMillis() - time)); employeeService.saveEmployee(new Employee("1000", "Sujoy", "Acharya")); time = System.currentTimeMillis(); employeeService.getEmployee("1000"); System.out.println("time taken ="+(System.currentTimeMillis() - time)); time = System.currentTimeMillis(); employeeService.getEmployee("1000"); System.out.println("time taken to read from cache ="+(System.currentTimeMillis() - time)); } } Note that the getEmployee method is invoked twice for each employee, and we recorded the method execution time in milliseconds. You will find from the output that every second call is answered by the cache, as the first call prints Loading a employee with id of : 101 and then the next call doesn't print the message but prints the time taken to execute. You will also find that the time taken for the cached objects is zero or less than the method invocation time. The following screenshot shows the output: Summary This article started with discovering the features of the new major Spring release 4.0, such as Java 8 support and so on. Then, we picked four Spring 4 topics and explored them one by one. The @Async section showcased the execution of long-running methods asynchronously and provided an example of how to handle asynchronous processing. The @RestController section eased the RESTful web service development with the advent of the @RestController annotation. The AsyncRestTemplate section explained the RESTful client code to invoke RESTful web service asynchronously. Caching is inevitable for a high-performance, scalable web application. The caching section explained the EhCache and Spring integrations to achieve a high-availability caching solution. Resources for Article: Further resources on this subject: Getting Started with Mockito [article] Progressive Mockito [article] Understanding outside-in [article]
Read more
  • 0
  • 0
  • 2002

article-image-elasticsearch-administration
Packt
03 Mar 2015
28 min read
Save for later

Elasticsearch Administration

Packt
03 Mar 2015
28 min read
In this article by Rafał Kuć and Marek Rogoziński, author of the book Mastering Elasticsearch, Second Edition we will talk more about the Elasticsearch configuration and new features introduced in Elasticsearch 1.0 and higher. By the end of this article, you will have learned: (For more resources related to this topic, see here.) Configuring the discovery and recovery modules Using the Cat API that allows a human-readable insight into the cluster status The backup and restore functionality Federated search Discovery and recovery modules When starting your Elasticsearch node, one of the first things that Elasticsearch does is look for a master node that has the same cluster name and is visible in the network. If a master node is found, the starting node gets joined into an already formed cluster. If no master is found, then the node itself is selected as a master (of course, if the configuration allows such behavior). The process of forming a cluster and finding nodes is called discovery. The module responsible for discovery has two main purposes—electing a master and discovering new nodes within a cluster. After the cluster is formed, a process called recovery is started. During the recovery process, Elasticsearch reads the metadata and the indices from the gateway, and prepares the shards that are stored there to be used. After the recovery of the primary shards is done, Elasticsearch should be ready for work and should continue with the recovery of all the replicas (if they are present). In this section, we will take a deeper look at these two modules and discuss the possibilities of configuration Elasticsearch gives us and what the consequences of changing them are. Note that the information provided in the Discovery and recovery modules section is an extension of what we already wrote in Elasticsearch Server Second Edition, published by Packt Publishing. Discovery configuration As we have already mentioned multiple times, Elasticsearch was designed to work in a distributed environment. This is the main difference when comparing Elasticsearch to other open source search and analytics solutions available. With such assumptions, Elasticsearch is very easy to set up in a distributed environment, and we are not forced to set up additional software to make it work like this. By default, Elasticsearch assumes that the cluster is automatically formed by the nodes that declare the same cluster.name setting and can communicate with each other using multicast requests. This allows us to have several independent clusters in the same network. There are a few implementations of the discovery module that we can use, so let's see what the options are. Zen discovery Zen discovery is the default mechanism that's responsible for discovery in Elasticsearch and is available by default. The default Zen discovery configuration uses multicast to find other nodes. This is a very convenient solution: just start a new Elasticsearch node and everything works—this node will be joined to the cluster if it has the same cluster name and is visible by other nodes in that cluster. This discovery method is perfectly suited for development time, because you don't need to care about the configuration; however, it is not advised that you use it in production environments. Relying only on the cluster name is handy but can also lead to potential problems and mistakes, such as the accidental joining of nodes. Sometimes, multicast is not available for various reasons or you don't want to use it for these mentioned reasons. In the case of bigger clusters, the multicast discovery may generate too much unnecessary traffic, and this is another valid reason why it shouldn't be used for production. For these cases, Zen discovery allows us to use the unicast mode. When using the unicast Zen discovery, a node that is not a part of the cluster will send a ping request to all the addresses specified in the configuration. By doing this, it informs all the specified nodes that it is ready to be a part of the cluster and can be either joined to an existing cluster or can form a new one. Of course, after the node joins the cluster, it gets the cluster topology information, but the initial connection is only done to the specified list of hosts. Remember that even when using unicast Zen discovery, the Elasticsearch node still needs to have the same cluster name as the other nodes. If you want to know more about the differences between multicast and unicast ping methods, refer to these URLs: http://en.wikipedia.org/wiki/Multicast and http://en.wikipedia.org/wiki/Unicast. If you still want to learn about the configuration properties of multicast Zen discovery, let's look at them. Multicast Zen discovery configuration The multicast part of the Zen discovery module exposes the following settings: discovery.zen.ping.multicast.address (the default: all available interfaces): This is the interface used for the communication given as the address or interface name. discovery.zen.ping.multicast.port (the default: 54328): This port is used for communication. discovery.zen.ping.multicast.group (the default: 224.2.2.4): This is the multicast address to send messages to. discovery.zen.ping.multicast.buffer_size (the default: 2048): This is the size of the buffer used for multicast messages. discovery.zen.ping.multicast.ttl (the default: 3): This is the time for which a multicast message lives. Every time a packet crosses the router, the TTL is decreased. This allows for the limiting area where the transmission can be received. Note that routers can have the threshold values assigned compared to TTL, which causes that TTL value to not match exactly the number of routers that a packet can jump over. discovery.zen.ping.multicast.enabled (the default: true): Setting this property to false turns off the multicast. You should disable multicast if you are planning to use the unicast discovery method. The unicast Zen discovery configuration The unicast part of Zen discovery provides the following configuration options: discovery.zen.ping.unicats.hosts: This is the initial list of nodes in the cluster. The list can be defined as a list or as an array of hosts. Every host can be given a name (or an IP address) or have a port or port range added. For example, the value of this property can look like this: ["master1", "master2:8181", "master3[80000-81000]"]. So, basically, the hosts' list for the unicast discovery doesn't need to be a complete list of Elasticsearch nodes in your cluster, because once the node is connected to one of the mentioned nodes, it will be informed about all the others that form the cluster. discovery.zen.ping.unicats.concurrent_connects (the default: 10): This is the maximum number of concurrent connections unicast discoveries will use. If you have a lot of nodes that the initial connection should be made to, it is advised that you increase the default value. Master node One of the main purposes of discovery apart from connecting to other nodes is to choose a master node—a node that will take care of and manage all the other nodes. This process is called master election and is a part of the discovery module. No matter how many master eligible nodes there are, each cluster will only have a single master node active at a given time. If there is more than one master eligible node present in the cluster, they can be elected as the master when the original master fails and is removed from the cluster. Configuring master and data nodes By default, Elasticsearch allows every node to be a master node and a data node. However, in certain situations, you may want to have worker nodes, which will only hold the data or process the queries and the master nodes that will only be used as cluster-managed nodes. One of these situations is to handle a massive amount of data, where data nodes should be as performant as possible, and there shouldn't be any delay in master nodes' responses. Configuring data-only nodes To set the node to only hold data, we need to instruct Elasticsearch that we don't want such a node to be a master node. In order to do this, we add the following properties to the elasticsearch.yml configuration file: node.master: falsenode.data: true Configuring master-only nodes To set the node not to hold data and only to be a master node, we need to instruct Elasticsearch that we don't want such a node to hold data. In order to do that, we add the following properties to the elasticsearch.yml configuration file: node.master: truenode.data: false Configuring the query processing-only nodes For large enough deployments, it is also wise to have nodes that are only responsible for aggregating query results from other nodes. Such nodes should be configured as nonmaster and nondata, so they should have the following properties in the elasticsearch.yml configuration file: node.master: falsenode.data: false Please note that the node.master and the node.data properties are set to true by default, but we tend to include them for configuration clarity. The master election configuration We already wrote about the master election configuration in Elasticsearch Server Second Edition, but this topic is very important, so we decided to refresh our knowledge about it. Imagine that you have a cluster that is built of 10 nodes. Everything is working fine until, one day, your network fails and three of your nodes are disconnected from the cluster, but they still see each other. Because of the Zen discovery and the master election process, the nodes that got disconnected elect a new master and you end up with two clusters with the same name with two master nodes. Such a situation is called a split-brain and you must avoid it as much as possible. When a split-brain happens, you end up with two (or more) clusters that won't join each other until the network (or any other) problems are fixed. If you index your data during this time, you may end up with data loss and unrecoverable situations when the nodes get joined together after the network split. In order to prevent split-brain situations or at least minimize the possibility of their occurrences, Elasticsearch provides a discovery.zen.minimum_master_nodes property. This property defines a minimum amount of master eligible nodes that should be connected to each other in order to form a cluster. So now, let's get back to our cluster; if we set the discovery.zen.minimum_master_nodes property to 50 percent of the total nodes available plus one (which is six, in our case), we would end up with a single cluster. Why is that? Before the network failure, we would have 10 nodes, which is more than six nodes, and these nodes would form a cluster. After the disconnections of the three nodes, we would still have the first cluster up and running. However, because only three nodes disconnected and three is less than six, these three nodes wouldn't be allowed to elect a new master and they would wait for reconnection with the original cluster. Zen discovery fault detection and configuration Elasticsearch runs two detection processes while it is working. The first process is to send ping requests from the current master node to all the other nodes in the cluster to check whether they are operational. The second process is a reverse of that—each of the nodes sends ping requests to the master in order to verify that it is still up and running and performing its duties. However, if we have a slow network or our nodes are in different hosting locations, the default configuration may not be sufficient. Because of this, the Elasticsearch discovery module exposes three properties that we can change: discovery.zen.fd.ping_interval: This defaults to 1s and specifies the interval of how often the node will send ping requests to the target node. discovery.zen.fd.ping_timeout: This defaults to 30s and specifies how long the node will wait for the sent ping request to be responded to. If your nodes are 100 percent utilized or your network is slow, you may consider increasing that property value. discovery.zen.fd.ping_retries: This defaults to 3 and specifies the number of ping request retries before the target node will be considered not operational. You can increase this value if your network has a high number of lost packets (or you can fix your network). There is one more thing that we would like to mention. The master node is the only node that can change the state of the cluster. To achieve a proper cluster state updates sequence, Elasticsearch master nodes process single cluster state update requests one at a time, make the changes locally, and send the request to all the other nodes so that they can synchronize their state. The master nodes wait for the given time for the nodes to respond, and if the time passes or all the nodes are returned, with the current acknowledgment information, it proceeds with the next cluster state update request processing. To change the time, the master node waits for all the other nodes to respond, and you should modify the default 30 seconds time by setting the discovery.zen.publish_timeout property. Increasing the value may be needed for huge clusters working in an overloaded network. The Amazon EC2 discovery Amazon, in addition to selling goods, has a few popular services such as selling storage or computing power in a pay-as-you-go model. So-called Amazon Elastic Compute Cloud (EC2) provides server instances and, of course, they can be used to install and run Elasticsearch clusters (among many other things, as these are normal Linux machines). This is convenient—you pay for instances that are needed in order to handle the current traffic or to speed up calculations, and you shut down unnecessary instances when the traffic is lower. Elasticsearch works well on EC2, but due to the nature of the environment, some features may work slightly differently. One of these features that works differently is discovery, because Amazon EC2 doesn't support multicast discovery. Of course, we can switch to unicast discovery, but sometimes, we want to be able to automatically discover nodes and, with unicast, we need to at least provide the initial list of hosts. However, there is an alternative—we can use the Amazon EC2 plugin, a plugin that combines the multicast and unicast discovery methods using the Amazon EC2 API. Make sure that during the set up of EC2 instances, you set up communication between them (on port 9200 and 9300 by default). This is crucial in order to have Elasticsearch nodes communicate with each other and, thus, cluster functioning is required. Of course, this communication depends on network.bind_host and network.publish_host (or network.host) settings. The EC2 plugin installation The installation of a plugin is as simple as with most of the plugins. In order to install it, we should run the following command: bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.0 The EC2 plugin's generic configuration This plugin provides several configuration settings that we need to provide in order for the EC2 discovery to work: cluster.aws.access_key: Amazon access key—one of the credential values you can find in the Amazon configuration panel cluster.aws.secret_key: Amazon secret key—similar to the previously mentioned access_key setting, it can be found in the EC2 configuration panel The last thing is to inform Elasticsearch that we want to use a new discovery type by setting the discovery.type property to ec2 value and turn off multicast. Optional EC2 discovery configuration options The previously mentioned settings are sufficient to run the EC2 discovery, but in order to control the EC2 discovery plugin behavior, Elasticsearch exposes additional settings: cloud.aws.region: This region will be used to connect with Amazon EC2 web services. You can choose a region that's adequate for the region where your instance resides, for example, eu-west-1 for Ireland. The possible values can be eu-west, sa-east, us-east, us-west-1, us-west-2, ap-southeast-1, and ap-southeast-1. cloud.aws.ec2.endpoint: If you are using EC2 API services, instead of defining a region, you can provide an address of the AWS endpoint, for example, ec2.eu-west-1.amazonaws.com. cloud.aws.protocol: This is the protocol that should be used by the plugin to connect to Amazon Web Services endpoints. By default, Elasticsearch will use the HTTPS protocol (which means setting the value of the property to https). We can also change this behavior and set the property to http for the plugin to use HTTP without encryption. We are also allowed to overwrite the cloud.aws.protocol settings for each service by using the cloud.aws.ec2.protocol and cloud.aws.s3.protocol properties (the possible values are the same—https and http). cloud.aws.proxy_host: Elasticsearch allows us to define a proxy that will be used to connect to AWS endpoints. The cloud.aws.proxy_host property should be set to the address to the proxy that should be used. cloud.aws.proxy_port: The second property related to the AWS endpoints proxy allows us to specify the port on which the proxy is listening. The cloud.aws.proxy_port property should be set to the port on which the proxy listens. discovery.ec2.ping_timeout (the default: 3s): This is the time to wait for the response for the ping message sent to the other node. After this time, the nonresponsive node will be considered dead and removed from the cluster. Increasing this value makes sense when dealing with network issues or we have a lot of EC2 nodes. The EC2 nodes scanning configuration The last group of settings we want to mention allows us to configure a very important thing when building cluster working inside the EC2 environment—the ability to filter available Elasticsearch nodes in our Amazon Elastic Cloud Computing network. The Elasticsearch EC2 plugin exposes the following properties that can help us configure its behavior: discovery.ec2.host_type: This allows us to choose the host type that will be used to communicate with other nodes in the cluster. The values we can use are private_ip (the default one; the private IP address will be used for communication), public_ip (the public IP address will be used for communication), private_dns (the private hostname will be used for communication), and public_dns (the public hostname will be used for communication). discovery.ec2.groups: This is a comma-separated list of security groups. Only nodes that fall within these groups can be discovered and included in the cluster. discovery.ec2.availability_zones: This is array or command-separated list of availability zones. Only nodes with the specified availability zones will be discovered and included in the cluster. discovery.ec2.any_group (this defaults to true): Setting this property to false will force the EC2 discovery plugin to discover only those nodes that reside in an Amazon instance that falls into all of the defined security groups. The default value requires only a single group to be matched. discovery.ec2.tag: This is a prefix for a group of EC2-related settings. When you launch your Amazon EC2 instances, you can define tags, which can describe the purpose of the instance, such as the customer name or environment type. Then, you use these defined settings to limit discovery nodes. Let's say you define a tag named environment with a value of qa. In the configuration, you can now specify the following: discovery.ec2.tag.environment: qa and only nodes running on instances with this tag will be considered for discovery. cloud.node.auto_attributes: When this is set to true, Elasticsearch will add EC2-related node attributes (such as the availability zone or group) to the node properties and will allow us to use them, adjusting the Elasticsearch shard allocation and configuring the shard placement. Other discovery implementations The Zen discovery and EC2 discovery are not the only discovery types that are available. There are two more discovery types that are developed and maintained by the Elasticsearch team, and these are: Azure discovery: https://github.com/elasticsearch/elasticsearch-cloud-azure Google Compute Engine discovery: https://github.com/elasticsearch/elasticsearch-cloud-gce In addition to these, there are a few discovery implementations provided by the community, such as the ZooKeeper discovery for older versions of Elasticsearch (https://github.com/sonian/elasticsearch-zookeeper). The gateway and recovery configuration The gateway module allows us to store all the data that is needed for Elasticsearch to work properly. This means that not only is the data in Apache Lucene indices stored, but also all the metadata (for example, index allocation settings), along with the mappings configuration for each index. Whenever the cluster state is changed, for example, when the allocation properties are changed, the cluster state will be persisted by using the gateway module. When the cluster is started up, its state will be loaded using the gateway module and applied. One should remember that when configuring different nodes and different gateway types, indices will use the gateway type configuration present on the given node. If an index state should not be stored using the gateway module, one should explicitly set the index gateway type to none. The gateway recovery process Let's say explicitly that the recovery process is used by Elasticsearch to load the data stored with the use of the gateway module in order for Elasticsearch to work. Whenever a full cluster restart occurs, the gateway process kicks in to load all the relevant information we've mentioned—the metadata, the mappings, and of course, all the indices. When the recovery process starts, the primary shards are initialized first, and then, depending on the replica state, they are initialized using the gateway data, or the data is copied from the primary shards if the replicas are out of sync. Elasticsearch allows us to configure when the cluster data should be recovered using the gateway module. We can tell Elasticsearch to wait for a certain number of master eligible or data nodes to be present in the cluster before starting the recovery process. However, one should remember that when the cluster is not recovered, all the operations performed on it will not be allowed. This is done in order to avoid modification conflicts. Configuration properties Before we continue with the configuration, we would like to say one more thing. As you know, Elasticsearch nodes can play different roles—they can have a role of data nodes—the ones that hold data—they can have a master role, or they can be only used for request handing, which means not holding data and not being master eligible. Remembering all this, let's now look at the gateway configuration properties that we are allowed to modify: gateway.recover_after_nodes: This is an integer number that specifies how many nodes should be present in the cluster for the recovery to happen. For example, when set to 5, at least 5 nodes (doesn't matter whether they are data or master eligible nodes) must be present for the recovery process to start. gateway.recover_after_data_nodes: This is an integer number that allows us to set how many data nodes should be present in the cluster for the recovery process to start. gateway.recover_after_master_nodes: This is another gateway configuration option that allows us to set how many master eligible nodes should be present in the cluster for the recovery to start. gateway.recover_after_time: This allows us to set how much time to wait before the recovery process starts after the conditions defined by the preceding properties are met. If we set this property to 5m, we tell Elasticsearch to start the recovery process 5 minutes after all the defined conditions are met. The default value for this property is 5m, starting from Elasticsearch 1.3.0. Let's imagine that we have six nodes in our cluster, out of which four are data eligible. We also have an index that is built of three shards, which are spread across the cluster. The last two nodes are master eligible and they don't hold the data. What we would like to configure is the recovery process to be delayed for 3 minutes after the four data nodes are present. Our gateway configuration could look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3m Expectations on nodes In addition to the already mentioned properties, we can also specify properties that will force the recovery process of Elasticsearch. These properties are: gateway.expected_nodes: This is the number of nodes expected to be present in the cluster for the recovery to start immediately. If you don't need the recovery to be delayed, it is advised that you set this property to the number of nodes (or at least most of them) with which the cluster will be formed from, because that will guarantee that the latest cluster state will be recovered. gateway.expected_data_nodes: This is the number of expected data eligible nodes to be present in the cluster for the recovery process to start immediately. gateway.expected_master_nodes: This is the number of expected master eligible nodes to be present in the cluster for the recovery process to start immediately. Now, let's get back to our previous example. We know that when all six nodes are connected and are in the cluster, we want the recovery to start. So, in addition to the preceeding configuration, we would add the following property: gateway.expected_nodes: 6 So the whole configuration would look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3mgateway.expected_nodes: 6 The preceding configuration says that the recovery process will be delayed for 3 minutes once four data nodes join the cluster and will begin immediately after six nodes are in the cluster (doesn't matter whether they are data nodes or master eligible nodes). The local gateway With the release of Elasticsearch 0.20 (and some of the releases from 0.19 versions), all the gateway types, apart from the default local gateway type, were deprecated. It is advised that you do not use them, because they will be removed in future versions of Elasticsearch. This is still not the case, but if you want to avoid full data reindexation, you should only use the local gateway type, and this is why we won't discuss all the other types. The local gateway type uses a local storage available on a node to store the metadata, mappings, and indices. In order to use this gateway type and the local storage available on the node, there needs to be enough disk space to hold the data with no memory caching. The persistence to the local gateway is different from the other gateways that are currently present (but deprecated). The writes to this gateway are done in a synchronous manner in order to ensure that no data will be lost during the write process. In order to set the type of gateway that should be used, one should use the gateway.type property, which is set to local by default. There is one additional thing regarding the local gateway of Elasticsearch that we didn't talk about—dangling indices. When a node joins a cluster, all the shards and indices that are present on the node, but are not present in the cluster, will be included in the cluster state. Such indices are called dangling indices, and we are allowed to choose how Elasticsearch should treat them. Elasticsearch exposes the gateway.local.auto_import_dangling property, which can take the value of yes (the default value that results in importing all dangling indices into the cluster), close (results in importing the dangling indices into the cluster state but keeps them closed by default), and no (results in removing the dangling indices). When setting the gateway.local.auto_import_dangling property to no, we can also set the gateway.local.dangling_timeout property (defaults to 2h) to specify how long Elasticsearch will wait while deleting the dangling indices. The dangling indices feature can be nice when we restart old Elasticsearch nodes, and we don't want old indices to be included in the cluster. Low-level recovery configuration We discussed that we can use the gateway to configure the behavior of the Elasticsearch recovery process, but in addition to that, Elasticsearch allows us to configure the recovery process itself. However, we decided that it would be good to mention the properties we can use in the section dedicated to gateway and recovery. Cluster- level recovery configuration The recovery configuration is specified mostly on the cluster level and allows us to set general rules for the recovery module to work with. These settings are: indices.recovery.concurrent_streams: This defaults to 3 and specifies the number of concurrent streams that are allowed to be opened in order to recover a shard from its source. The higher the value of this property, the more pressure will be put on the networking layer; however, the recovery may be faster, depending on your network usage and throughput. indices.recovery.max_bytes_per_sec: By default, this is set to 20MB and specifies the maximum number of data that can be transferred during shard recovery per second. In order to disable data transfer limiting, one should set this property to 0. Similar to the number of concurrent streams, this property allows us to control the network usage of the recovery process. Setting this property to higher values may result in higher network utilization and a faster recovery process. indices.recovery.compress: This is set to true by default and allows us to define whether ElasticSearch should compress the data that is transferred during the recovery process. Setting this to false may lower the pressure on the CPU, but it will also result in more data being transferred over the network. indices.recovery.file_chunk_size: This is the chunk size used to copy the shard data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. indices.recovery.translog_ops: This defaults to 1000 and specifies how many transaction log lines should be transferred between shards in a single request during the recovery process. indices.recovery.translog_size: This is the chunk size used to copy the shard transaction log data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. In the versions prior to Elasticsearch 0.90.0, there was the indices.recovery.max_size_per_sec property that could be used, but it was deprecated, and it is suggested that you use the indices.recovery.max_bytes_per_sec property instead. However, if you are using an Elasticsearch version older than 0.90.0, it may be worth remembering this. All the previously mentioned settings can be updated using the Cluster Update API, or they can be set in the elasticsearch.yml file. Index-level recovery settings In addition to the values mentioned previously, there is a single property that can be set on a per-index basis. The property can be set both in the elasticsearch.yml file and using the indices Update Settings API, and it is called index.recovery.initial_shards. In general, Elasticsearch will only recover a particular shard when there is a quorum of shards present and if that quorum can be allocated. A quorum is 50 percent of the shards for the given index plus one. By using the index.recovery.initial_shards property, we can change what Elasticsearch will take as a quorum. This property can be set to the one of the following values: quorum: 50 percent, plus one shard needs to be present and be allocable. This is the default value. quorum-1: 50 percent of the shards for a given index need to be present and be allocable. full: All of the shards for the given index need to be present and be allocable. full-1: 100 percent minus one shards for the given index need to be present and be allocable. integer value: Any integer such as 1, 2, or 5 specifies the number of shards that are needed to be present and that can be allocated. For example, setting this value to 2 will mean that at least two shards need to be present and Elasticsearch needs at least 2 shards to be allocable. It is good to know about this property, but in most cases, the default value will be sufficient for most deployments. Summary In this article, we focused more on the Elasticsearch configuration and new features that were introduced in Elasticsearch 1.0. We configured discovery and recovery, and we used the human-friendly Cat API. In addition to that, we used the backup and restore functionality, which allowed easy backup and recovery of our indices. Finally, we looked at what federated search is and how to search and index data to multiple clusters, while still using all the functionalities of Elasticsearch and being connected to a single node. If you want to dig deeper, buy the book Mastering Elasticsearch, Second Edition and read in a simple step-by-step fashion using Elasticsearch to enhance your knowlege further. Resources for Article: Further resources on this subject: Downloading and Setting Up ElasticSearch [Article] Indexing the Data [Article] Driving Visual Analyses with Automobile Data (Python) [Article]
Read more
  • 0
  • 0
  • 5417

article-image-speeding-vagrant-development-docker
Packt
03 Mar 2015
13 min read
Save for later

Speeding Vagrant Development With Docker

Packt
03 Mar 2015
13 min read
In this article by Chad Thompson, author of Vagrant Virtual Development Environment Cookbook, we will learn that many software developers are familiar with using Vagrant (http://vagrantup.com) to distribute and maintain development environments. In most cases, Vagrant is used to manage virtual machines running in desktop hypervisor software such as VirtualBox or the VMware Desktop product suites. (VMware Fusion for OS X and VMware Desktop for Linux and Windows environments.) More recently, Docker (http://docker.io) has become increasingly popular for deploying containers—Linux processes that can run in a single operating system environment yet be isolated from one another. In practice, this means that a container includes the runtime environment for an application, down to the operating system level. While containers have been popular for deploying applications, we can also use them for desktop development. Vagrant can use Docker in a couple of ways: As a target for running a process defined by Vagrant with the Vagrant provider. As a complete development environment for building and testing containers within the context of a virtual machine. This allows you to build a complete production-like container deployment environment with the Vagrant provisioner. In this example, we'll take a look at how we can use the Vagrant provider to build and run a web server. Running our web server with Docker will allow us to build and test our web application without the added overhead of booting and provisioning a virtual machine. (For more resources related to this topic, see here.) Introducing the Vagrant Provider The Vagrant Docker provider will build and deploy containers to a Docker runtime. There are a couple of cases to consider when using Vagrant with Docker: On a Linux host machine, Vagrant will use a native (locally installed) Docker environment to deploy containers. Make sure that Docker is installed before using Vagrant. Docker itself is a technology built on top of Linux Containers (LXC) technology—so Docker itself requires an operating system with a recent version (newer than Linux 3.8 which was released in February, 2013) of the Linux kernel. Most recent Linux distributions should support the ability to run Docker. On nonLinux environments (namely OS X and Windows), the provider will require a local Linux runtime to be present for deploying containers. When running the Docker provisioner in these environments, Vagrant will download and boot a version of the boot2docker (http://boot2docker.io) environment—in this case, a repackaging of boot2docker in Vagrant box format. Let's take a look at two scenarios for using the Docker provider. In each of these examples, we'll start these environments from an OS X environment so we will see some tasks that are required for using the boot2docker environment. Installing a Docker image from a repository We'll start with a simple case: installing a Docker container from a repository (a MySQL container) and connecting it to an external tool for development (the MySQL Workbench or a client tool of your choice). We'll need to initialize the boot2docker environment and use some Vagrant tools to interact with the environment and the deployed containers. Before we can start, we'll need to find a suitable Docker image to launch. One of the unique advantages to use Docker as a development environment is its ability to select a base Docker image, then add successive build steps on top of the base image. In this simple example, we can find a base MySQL image on the Docker Hub registry. (https://registry.hub.docker.com).The MySQL project provides an official Docker image that we can build from. We'll note from the repository the command for using the image: docker pull mysql and note that the image name is mysql. Start with a Vagrantfile that defines the docker: # -*- mode: ruby -*- # vi: set ft=ruby :   VAGRANTFILE_API_VERSION = "2" ENV['VAGRANT_DEFAULT_PROVIDER'] = 'vmware_fusion' Vagrant.configure(VAGRANTFILE_API_VERSION) do |config| config.vm.define"database" do |db|    db.vm.provider"docker"do |d|      d.image="mysql"    end end end An important thing to note immediately is that when we define the database machine and the provider with the Docker provider, we do not specify a box file. The Docker provider will start and launch containers into a boot2docker environment, negating the need for a Vagrant box or virtual machine definition. This will introduce a bit of a complication in interacting with the Vagrant environment in later steps. Also note the mysql image taken from the Docker Hub Registry. We'll need to launch the image with a few basic parameters. Add the following to the Docker provider block:    db.vm.provider "docker" do |d|      d.image="mysql"      d.env = {        :MYSQL_ROOT_PASSWORD => ""root",        :MYSQL_DATABASE     => ""dockertest",        :MYSQL_USER         => ""dockertest",        :MYSQL_PASSWORD     => ""d0cker"      }      d.ports =["3306:3306"]      d.remains_running = "true"    end The environment variables (d.env) are taken from the documentation on the MySQL Docker image page (https://registry.hub.docker.com/_/mysql/). This is how the image expects to set certain parameters. In this case, our parameters will set the database root password (for the root user) and create a database with a new user that has full permissions to that database. The d.ports parameter is an array of port listings that will be forwarded from the container (the default MySQL port of 3306) to the host operating system, in this case also 3306.The contained application will, thus, behave like a natively installed MySQL installation. The port forwarding here is from the container to the operating system that hosts the container (in this case, the container host is our boot2docker image). If we are developing and hosting containers natively with Vagrant on a Linux distribution, the port forwarding will be to localhost, but boot2docker introduces something of a wrinkle in doing Docker development on Windows or OS X. We'll either need to refer to our software installation by the IP of the boot2docker container or configure a second port forwarding configuration that allows a Docker contained application to be available to the host operating system as localhost. The final parameter (d.remains_running = true) is a flag for Vagrant to note that the Vagrant run should mark as failed if the Docker container exits on start. In the case of software that runs as a daemon process (such as the MySQL database), a Docker container that exits immediately is an error condition. Start the container using the vagrant up –provider=docker command. A few things will happen here: If this is the first time you have started the project, you'll see some messages about booting a box named mitchellh/boot2docker. This is a Vagrant-packaged version of the boot2docker project. Once the machine boots, it becomes a host for all Docker containers managed with Vagrant. Keep in mind that boot2doocker is necessary only for nonLinux operating systems that are running Docker through a virtual machine. On a Linux system running Docker natively, you will not see information about boot2docker. After the container is booted (or if it is already running), Vagrant will display notifications about rsyncing a folder (if we are using boot2docker) and launching the image: Docker generates unique identifiers for containers and notes any port mapping information. Let's take a look at some details on the containers that are running in the Docker host. We'll need to find a way to gain access to the Vagrant boot2docker image (and only if we are using boot2docker and not a native Linux environment), which is not quite as straightforward as a vagrant ssh; we'll need to identify the Vagrant container to access. First, identify the Docker Vagrant machine from the global Vagrant status. Vagrant keeps track of running instances that can be accessed from Vagrant itself. In this case, we are only interested in the Vagrant instance named docker-host. The instance we're interested in can be found with the vagrant global-status command: In this case, Vagrant identifies the instance as d381331 (a unique value for every Vagrant machine launched). We can access this instance with a vagrant ssh command: vagrant ssh d381331 This will display an ASCII-art boot2docker logo and a command prompt for the boot2docker instance. Let's take a look at Docker containers running on the system with the docker psps command: The docker ps command will provide information about the running Docker containers on the system; in this case, the unique ID of the container (output during the Vagrant startup) and other information about the container. Find the IP address of the boot2docker (only if we're using boot2docker) to connect to the MySQL instance. In this case, execute the ifconfig command: docker@boot2docker:~$ ifconfig This will output information about the network interfaces on the machine; we are interested in the eth0 entry. In particular, we can note the IP address of the machine on the eth0 interface: Make a note of the IP address noted as the inet addr; in this case 192.168.30.129. Connect a MySQL client to the running Docker container. In this case, we'll need to note some information to the connection: The IP address of the boot2docker virtual machine (if using boot2docker). In this case, we'll note 192.168.30.129. The port that the MySQL instance will respond to on the Docker host. In this case, the Docker container is forwarding port 3306 in the container to port 3306 on the host. Information noted in the Vagrantfile for the username or password on the MySQL instance. With this information in hand, we can configure a MySQL client. The MySQL project provides a supported GUI client named MySQL Workbench (http://www.mysql.com/products/workbench/). With the client installed on our host operating system, we can create a new connection in the Workbench client (consult the documentation for your version of Workbench, or use a MySQL client of your choice). In this case, we're connecting to the boot2docker instance. If you are running Docker natively on a Linux instance, the connection should simply forward to localhost. If the connection is successful, the Workbench client once connected will display an empty database: Once we've connected, we can use the MySQL database as we would for any other MySQL instance that is hosted this time in a Docker container without having to install and configure the MySQL package itself. Building a Docker image with Vagrant While launching packaged Docker, applications can be useful (particularly in the case where launching a Docker container is simpler than native installation steps), Vagrant becomes even more useful when used to launch containers that are being developed. On OS X and Windows machines, the use of Vagrant can make managing the container deployment somewhat simpler through the boot2docker containers, while on Linux, using the native Docker tools could be somewhat simpler. In this example, we'll use a simple Dockerfile to modify a base image. First, start with a simple Vagrantfile. In this case, we'll specify a build directory rather than a image file: # -*- mode: ruby -*- # vi: set ft=ruby :   # Vagrantfile API/syntax version. Don't touch unless you know what you're doing! VAGRANTFILE_API_VERSION = "2" ENV['VAGRANT_DEFAULT_PROVIDER'] = 'vmware_fusion'   Vagrant.configure(VAGRANTFILE_API_VERSION) do |config| config.vm.define "nginx" do |nginx|    nginx.vm.provider "docker" do |d|      d.build_dir = "build"      d.ports = ["49153:80"]    end end end This Vagrantfile specifies a build directory as well as the ports forwarded to the host from the container. In this case, the standard HTTP port (80) forwards to port 49153 on the host machine, which in this case is the boot2docker instance. Create our build directory in the same directory as the Vagrantfile. In the build directory, create a Dockerfile. A Dockerfile is a set of instructions on how to build a Docker container. See https://docs.docker.com/reference/builder/ or James Turnbull's The Docker Book for more information on how to construct a Dockerfile. In this example, we'll use a simple Dockerfile to copy a working HTML directory to a base NGINX image: FROM nginx COPY content /usr/share/nginx/html Create a directory in our build directory named content. In the directory, place a simple index.html file that will be served from the new container: <html> <body>    <div style="text-align:center;padding-top:40px;border:dashed 2px;">      This is an NGINX build.    </div> </body> </html> Once all the pieces are in place, our working directory will have the following structure: . ├── Vagrantfile └── build ├── Dockerfile    └── content        └── index.html Start the container in the working directory with the command: vagrant up nginx --provider=docker This will start the container build and deploy process. Once the container is launched, the web server can be accessed using the IP address of the boot2docker instance (see the previous section for more information on obtaining this address) and the forwarded port. One other item to note, especially, if you have completed both steps in this section without halting or destroying the Vagrant project is that when using the Docker provider, containers are deployed to a single shared virtual machine. If the boot2docker instance is accessed and the docker ps command is executed, it can be noted that two separate Vagrant projects deploy containers to a single host. When using the Docker provider, the single instance has a few effects: The single virtual machine can use fewer resources on your development workstation Deploying and rebuilding containers is a process that is much faster than booting and shutting down entire operating systems Docker development with the Docker provider can be a useful technique to create and test Docker containers, although Vagrant might not be of particular help in packaging and distributing Docker containers. If you wish to publish containers, consult the documentation or The Docker Book on getting started with packaging and distributing Docker containers. See also Docker: http://docker.io boot2docker: http://boot2docker.io The Docker Book: http://www.dockerbook.com The Docker repository: https://registry.hub.docker.com Summary In this article, we learned how to use Docker provisioner with Vagrant by covering the topics mentioned in the preceding paragraphs. Resources for Article: Further resources on this subject: Going Beyond the Basics [article] Module, Facts, Types and Reporting tools in Puppet [article] Setting Up a Development Environment [article]
Read more
  • 0
  • 0
  • 13344
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-scipy-signal-processing
Packt
03 Mar 2015
14 min read
Save for later

SciPy for Signal Processing

Packt
03 Mar 2015
14 min read
In this article by Sergio J. Rojas G. and Erik A Christensen, authors of the book Learning SciPy for Numerical and Scientific Computing - Second Edition, we will focus on the usage of some most commonly used routines that are included in SciPy modules—scipy.signal, scipy.ndimage, and scipy.fftpack, which are used for signal processing, multidimensional image processing, and computing Fourier transforms, respectively. We define a signal as data that measures either a time-varying or spatially varying phenomena. Sound or electrocardiograms are excellent examples of time-varying quantities, while images embody the quintessential spatially varying cases. Moving images are treated with the techniques of both types of signals, obviously. The field of signal processing treats four aspects of this kind of data: its acquisition, quality improvement, compression, and feature extraction. SciPy has many routines to treat effectively tasks in any of the four fields. All these are included in two low-level modules (scipy.signal being the main module, with an emphasis on time-varying data, and scipy.ndimage, for images). Many of the routines in these two modules are based on Discrete Fourier Transform of the data. SciPy has an extensive package of applications and definitions of these background algorithms, scipy.fftpack, which we will start covering first. (For more resources related to this topic, see here.) Discrete Fourier Transforms The Discrete Fourier Transform (DFT from now on) transforms any signal from its time/space domain into a related signal in the frequency domain. This allows us not only to be able to analyze the different frequencies of the data, but also for faster filtering operations, when used properly. It is possible to turn a signal in the frequency domain back to its time/spatial domain; thanks to the Inverse Fourier Transform. We will not go into detail of the mathematics behind these operators, since we assume familiarity at some level with this theory. We will focus on syntax and applications instead. The basic routines in the scipy.fftpack module compute the DFT and its inverse, for discrete signals in any dimension, which are fft and ifft (one dimension), fft2 and ifft2 (two dimensions), and fftn and ifftn (any number of dimensions). All of these routines assume that the data is complex valued. If we know beforehand that a particular dataset is actually real valued, and should offer real-valued frequencies, we use rfft and irfft instead, for a faster algorithm. All these routines are designed so that composition with their inverses always yields the identity. The syntax is the same in all cases, as follows: fft(x[, n, axis, overwrite_x]) The first parameter, x, is always the signal in any array-like form. Note that fft performs one-dimensional transforms. This means in particular, that if x happens to be two-dimensional, for example, fft will output another two-dimensional array where each row is the transform of each row of the original. We can change it to columns instead, with the optional parameter, axis. The rest of parameters are also optional; n indicates the length of the transform, and overwrite_x gets rid of the original data to save memory and resources. We usually play with the integer n when we need to pad the signal with zeros, or truncate it. For higher dimension, n is substituted by shape (a tuple), and axis by axes (another tuple). To better understand the output, it is often useful to shift the zero frequencies to the center of the output arrays with fftshift. The inverse of this operation, ifftshift, is also included in the module. The following code shows some of these routines in action, when applied to a checkerboard image: >>> import numpy >>> from scipy.fftpack import fft,fft2, fftshift >>> import matplotlib.pyplot as plt >>> B=numpy.ones((4,4)); W=numpy.zeros((4,4)) >>> signal = numpy.bmat("B,W;W,B") >>> onedimfft = fft(signal,n=16) >>> twodimfft = fft2(signal,shape=(16,16)) >>> plt.figure() >>> plt.gray() >>> plt.subplot(121,aspect='equal') >>> plt.pcolormesh(onedimfft.real) >>> plt.colorbar(orientation='horizontal') >>> plt.subplot(122,aspect='equal') >>> plt.pcolormesh(fftshift(twodimfft.real)) >>> plt.colorbar(orientation='horizontal') >>> plt.show() Note how the first four rows of the one-dimensional transform are equal (and so are the last four), while the two-dimensional transform (once shifted) presents a peak at the origin, and nice symmetries in the frequency domain. In the following screenshot (obtained from the preceding code), the left-hand side image is fft and the right-hand side image is fft2 of a 2 x 2 checkerboard signal: The scipy.fftpack module also offers the Discrete Cosine Transform with its inverse (dct, idct) as well as many differential and pseudo-differential operators defined in terms of all these transforms: diff (for derivative/integral), hilbert and ihilbert (for the Hilbert transform), tilbert and itilbert (for the h-Tilbert transform of periodic sequences), and so on. Signal construction To aid in the construction of signals with predetermined properties, the scipy.signal module has a nice collection of the most frequent one-dimensional waveforms in the literature: chirp and sweep_poly (for the frequency-swept cosine generator), gausspulse (a Gaussian modulated sinusoid) and sawtooth and square (for the waveforms with those names). They all take as their main parameter a one-dimensional ndarray representing the times at which the signal is to be evaluated. Other parameters control the design of the signal, according to frequency or time constraints. Let's take a look into the following code snippet, which illustrates the use of these one dimensional waveforms that we just discussed: >>> import numpy >>> from scipy.signal import chirp, sawtooth, square, gausspulse >>> import matplotlib.pyplot as plt >>> t=numpy.linspace(-1,1,1000) >>> plt.subplot(221); plt.ylim([-2,2]) >>> plt.plot(t,chirp(t,f0=100,t1=0.5,f1=200))   # plot a chirp >>> plt.subplot(222); plt.ylim([-2,2]) >>> plt.plot(t,gausspulse(t,fc=10,bw=0.5))     # Gauss pulse >>> plt.subplot(223); plt.ylim([-2,2]) >>> t*=3*numpy.pi >>> plt.plot(t,sawtooth(t))                     # sawtooth >>> plt.subplot(224); plt.ylim([-2,2]) >>> plt.plot(t,square(t))                       # Square wave >>> plt.show() Generated by this code, the following diagram shows waveforms for chirp (upper-left), gausspulse (upper-right), sawtooth (lower-left), and square (lower-right): The usual method of creating signals is to import them from the file. This is possible by using purely NumPy routines, for example fromfile: fromfile(file, dtype=float, count=-1, sep='') The file argument may point to either a file or a string, the count argument is used to determine the number of items to read, and sep indicates what constitutes a separator in the original file/string. For images, we have the versatile routine, imread in either the scipy.ndimage or scipy.misc module: imread(fname, flatten=False) The fname argument is a string containing the location of an image. The routine infers the type of file, and reads the data into an array, accordingly. In case the flatten argument is turned to True, the image is converted to gray scale. Note that, in order to work, the Python Imaging Library (PIL) needs to be installed. It is also possible to load .wav files for analysis, with the read and write routines from the wavfile submodule in the scipy.io module. For instance, given any audio file with this format, say audio.wav, the command, rate,data = scipy.io.wavfile.read("audio.wav"), assigns an integer value to the rate variable, indicating the sample rate of the file (in samples per second), and a NumPy ndarray to the data variable, containing the numerical values assigned to the different notes. If we wish to write some one-dimensional ndarray data into an audio file of this kind, with the sample rate given by the rate variable, we may do so by issuing the following command: >>> scipy.io.wavfile.write("filename.wav",rate,data) Filters A filter is an operation on signals that either removes features or extracts some component. SciPy has a very complete set of known filters, as well as the tools to allow construction of new ones. The complete list of filters in SciPy is long, and we encourage the reader to explore the help documents of the scipy.signal and scipy.ndimage modules for the complete picture. We will introduce in these pages, as an exposition, some of the most used filters in the treatment of audio or image processing. We start by creating a signal worth filtering: >>> from numpy import sin, cos, pi, linspace >>> f=lambda t: cos(pi*t) + 0.2*sin(5*pi*t+0.1) + 0.2*sin(30*pi*t)    + 0.1*sin(32*pi*t+0.1) + 0.1*sin(47* pi*t+0.8) >>> t=linspace(0,4,400); signal=f(t) We first test the classical smoothing filter of Wiener and Kolmogorov, wiener. We present in a plot, the original signal (in black) and the corresponding filtered data, with a choice of a Wiener window of the size 55 samples (in blue). Next, we compare the result of applying the median filter, medfilt, with a kernel of the same size as before (in red): >>> from scipy.signal import wiener, medfilt >>> import matplotlib.pylab as plt >>> plt.plot(t,signal,'k') >>> plt.plot(t,wiener(signal,mysize=55),'r',linewidth=3) >>> plt.plot(t,medfilt(signal,kernel_size=55),'b',linewidth=3) >>> plt.show() This gives us the following graph showing the comparison of smoothing filters (wiener is the one that has its starting point just below 0.5 and medfilt has its starting point just above 0.5): Most of the filters in the scipy.signal module can be adapted to work in arrays of any dimension. But in the particular case of images, we prefer to use the implementations in the scipy.ndimage module, since they are coded with these objects in mind. For instance, to perform a median filter on an image for smoothing, we use scipy.ndimage.median_filter. Let's see an example. We will start by loading Lena to the array and corrupting the image with Gaussian noise (zero mean and standard deviation of 16): >>> from scipy.stats import norm     # Gaussian distribution >>> import matplotlib.pyplot as plt >>> import scipy.misc >>> import scipy.ndimage >>> plt.gray() >>> lena=scipy.misc.lena().astype(float) >>> plt.subplot(221); >>> plt.imshow(lena) >>> lena+=norm(loc=0,scale=16).rvs(lena.shape) >>> plt.subplot(222); >>> plt.imshow(lena) >>> denoised_lena = scipy.ndimage.median_filter(lena,3) >>> plt.subplot(224); >>> plt.imshow(denoised_lena) The set of filters for images come in two flavors—statistical and morphological. For example, among the filters of statistical nature, we have the Sobel algorithm oriented to detection of edges (singularities along curves). Its syntax is as follows: sobel(image, axis=-1, output=None, mode='reflect', cval=0.0) The optional parameter, axis, indicates the dimension in which the computations are performed. By default, this is always the last axis (-1). The mode parameter, which is one of the strings 'reflect', 'constant', 'nearest', 'mirror', or 'wrap', indicates how to handle the border of the image, in case there is insufficient data to perform the computations there. In case the mode is 'constant', we may indicate the value to use in the border, with the cval parameter. Let's look into the following code snippet, which illustrates the use of the sobel filter: >>> from scipy.ndimage.filters import sobel >>> import numpy >>> lena=scipy.misc.lena() >>> sblX=sobel(lena,axis=0); sblY=sobel(lena,axis=1) >>> sbl=numpy.hypot(sblX,sblY) >>> plt.subplot(223); >>> plt.imshow(sbl) >>> plt.show() The following screenshot illustrates Lena (upper-left) and noisy Lena (upper-right) with the preceding two filters in action—edge map with sobel (lower-left) and median filter (lower-right): Morphology We also have the possibility of creating and applying filters to images based on mathematical morphology, both to binary and gray-scale images. The four basic morphological operations are opening (binary_opening), closing (binary_closing), dilation (binary_dilation), and erosion (binary_erosion). Note that the syntax for each of these filters is very simple, since we only need two ingredients—the signal to filter and the structuring element to perform the morphological operation. Let's take a look into the general syntax for these morphological operations: binary_operation(signal, structuring_element) We may use combinations of these four basic morphological operations to create more complex filters for removal of holes, hit-or-miss transforms (to find the location of specific patterns in binary images), denoising, edge detection, and many more. The SciPy module also allows for creating some common filters using the preceding syntax. For instance, for the location of the letter e in a text, we could use the following command instead: >>> binary_hit_or_miss(text, letterE) For comparative purposes, let's use this command in the following code snippet: >>> import numpy >>> import scipy.ndimage >>> import matplotlib.pylab as plt >>> from scipy.ndimage.morphology import binary_hit_or_miss >>> text = scipy.ndimage.imread('CHAP_05_input_textImage.png') >>> letterE = text[37:53,275:291] >>> HitorMiss = binary_hit_or_miss(text, structure1=letterE,    origin1=1) >>> eLocation = numpy.where(HitorMiss==True) >>> x=eLocation[1]; y=eLocation[0] >>> plt.imshow(text, cmap=plt.cm.gray, interpolation='nearest') >>> plt.autoscale(False) >>> plt.plot(x,y,'wo',markersize=10) >>> plt.axis('off') >>> plt.show() The output for the preceding lines of code is generated as follows: For gray-scale images, we may use a structuring element (structuring_element) or a footprint. The syntax is, therefore, a little different: grey_operation(signal, [structuring_element, footprint, size, ...]) If we desire to use a completely flat and rectangular structuring element (all ones), then it is enough to indicate the size as a tuple. For instance, to perform gray-scale dilation of a flat element of size (15,15) on our classical image of Lena, we issue the following command: >>> grey_dilation(lena, size=(15,15)) The last kind of morphological operations coded in the scipy.ndimage module perform distance and feature transforms. Distance transforms create a map that assigns to each pixel, the distance to the nearest object. Feature transforms provide with the index of the closest background element instead. These operations are used to decompose images into different labels. We may even choose different metrics such as Euclidean distance, chessboard distance, and taxicab distance. The syntax for the distance transform (distance_transform) using a brute force algorithm is as follows: distance_transform_bf(signal, metric='euclidean', sampling=None, return_distances=True, return_indices=False,                      distances=None, indices=None) We indicate the metric with the strings such as 'euclidean', 'taxicab', or 'chessboard'. If we desire to provide the feature transform instead, we switch return_distances to False and return_indices to True. Similar routines are available with more sophisticated algorithms—distance_transform_cdt (using chamfering for taxicab and chessboard distances). For Euclidean distance, we also have distance_transform_edt. All these use the same syntax. Summary In this article, we explored signal processing (any dimensional) including the treatment of signals in frequency space, by means of their Discrete Fourier Transforms. These correspond to the fftpack, signal, and ndimage modules. Resources for Article: Further resources on this subject: Signal Processing Techniques [article] SciPy for Computational Geometry [article] Move Further with NumPy Modules [article]
Read more
  • 0
  • 0
  • 13934

article-image-postgresql-extensible-rdbms
Packt
03 Mar 2015
18 min read
Save for later

PostgreSQL as an Extensible RDBMS

Packt
03 Mar 2015
18 min read
This article by Usama Dar, the author of the book PostgreSQL Server Programming - Second Edition, explains the process of creating a new operator, overloading it, optimizing it, creating index access methods, and much more. PostgreSQL is an extensible database. I hope you've learned this much by now. It is extensible by virtue of the design that it has. As discussed before, PostgreSQL uses a catalog-driven design. In fact, PostgreSQL is more catalog-driven than most of the traditional relational databases. The key benefit here is that the catalogs can be changed or added to, in order to modify or extend the database functionality. PostgreSQL also supports dynamic loading, that is, a user-written code can be provided as a shared library, and PostgreSQL will load it as required. (For more resources related to this topic, see here.) Extensibility is critical for many businesses, which have needs that are specific to that business or industry. Sometimes, the tools provided by the traditional database systems do not fulfill those needs. People in those businesses know best how to solve their particular problems, but they are not experts in database internals. It is often not possible for them to cook up their own database kernel or modify the core or customize it according to their needs. A truly extensible database will then allow you to do the following: Solve domain-specific problems in a seamless way, like a native solution Build complete features without modifying the core database engine Extend the database without interrupting availability PostgreSQL not only allows you to do all of the preceding things, but also does these, and more with utmost ease. In terms of extensibility, you can do the following things in a PostgreSQL database: Create your own data types Create your own functions Create your own aggregates Create your own operators Create your own index access methods (operator classes) Create your own server programming language Create foreign data wrappers (SQL/MED) and foreign tables What can't be extended? Although PostgreSQL is an extensible platform, there are certain things that you can't do or change without explicitly doing a fork, as follows: You can't change or plug in a new storage engine. If you are coming from the MySQL world, this might annoy you a little. However, PostgreSQL's storage engine is tightly coupled with its executor and the rest of the system, which has its own benefits. You can't plug in your own planner/parser. One can argue for and against the ability to do that, but at the moment, the planner, parser, optimizer, and so on are baked into the system and there is no possibility of replacing them. There has been some talk on this topic, and if you are of the curious kind, you can read some of the discussion at http://bit.ly/1yRMkK7. We will now briefly discuss some more of the extensibility capabilities of PostgreSQL. We will not dive deep into the topics, but we will point you to the appropriate link where more information can be found. Creating a new operator Now, let's take look at how we can add a new operator in PostgreSQL. Adding new operators is not too different from adding new functions. In fact, an operator is syntactically just a different way to use an existing function. For example, the + operator calls a built-in function called numeric_add and passes it the two arguments. When you define a new operator, you must define the data types that the operator expects as arguments and define which function is to be called. Let's take a look at how to define a simple operator. You have to use the CREATE OPERATOR command to create an operator. Let's use that function to create a new Fibonacci operator, ##, which will have an integer on its left-hand side: CREATE OPERATOR ## (PROCEDURE=fib, LEFTARG=integer); Now, you can use this operator in your SQL to calculate a Fibonacci number: testdb=# SELECT 12##;?column?----------144(1 row) Note that we defined that the operator will have an integer on the left-hand side. If you try to put a value on the right-hand side of the operator, you will get an error: postgres=# SELECT ##12;ERROR: operator does not exist: ## integer at character 8HINT: No operator matches the given name and argument type(s). Youmight need to add explicit type casts.STATEMENT: select ##12;ERROR: operator does not exist: ## integerLINE 1: select ##12;^HINT: No operator matches the given name and argument type(s). Youmight need to add explicit type casts. Overloading an operator Operators can be overloaded in the same way as functions. This means, that an operator can have the same name as an existing operator but with a different set of argument types. More than one operator can have the same name, but two operators can't share the same name if they accept the same types and positions of the arguments. As long as there is a function that accepts the same kind and number of arguments that an operator defines, it can be overloaded. Let's override the ## operator we defined in the last example, and also add the ability to provide an integer on the right-hand side of the operator: CREATE OPERATOR ## (PROCEDURE=fib, RIGHTARG=integer); Now, running the same SQL, which resulted in an error last time, should succeed, as shown here: testdb=# SELECT ##12;?column?----------144(1 row) You can drop the operator using the DROP OPERATOR command. You can read more about creating and overloading new operators in the PostgreSQL documentation at http://www.postgresql.org/docs/current/static/sql-createoperator.html and http://www.postgresql.org/docs/current/static/xoper.html. There are several optional clauses in the operator definition that can optimize the execution time of the operators by providing information about operator behavior. For example, you can specify the commutator and the negator of an operator that help the planner use the operators in index scans. You can read more about these optional clauses at http://www.postgresql.org/docs/current/static/xoper-optimization.html. Since this article is just an introduction to the additional extensibility capabilities of PostgreSQL, we will just introduce a couple of optimization options; any serious production quality operator definitions should include these optimization clauses, if applicable. Optimizing operators The optional clauses tell the PostgreSQL server about how the operators behave. These options can result in considerable speedups in the execution of queries that use the operator. However, if you provide these options incorrectly, it can result in a slowdown of the queries. Let's take a look at two optimization clauses called commutator and negator. COMMUTATOR This clause defines the commuter of the operator. An operator A is a commutator of operator B if it fulfils the following condition: x A y = y B x. It is important to provide this information for the operators that will be used in indexes and joins. As an example, the commutator for > is <, and the commutator of = is = itself. This helps the optimizer to flip the operator in order to use an index. For example, consider the following query: SELECT * FROM employee WHERE new_salary > salary; If the index is defined on the salary column, then PostgreSQL can rewrite the preceding query as shown: SELECT * from employee WHERE salary < new_salary This allows PostgreSQL to use a range scan on the index column salary. For a user-defined operator, the optimizer can only do this flip around if the commutator of a user-defined operator is defined: CREATE OPERATOR > (LEFTARG=integer, RIGHTARG=integer, PROCEDURE=comp, COMMUTATOR = <) NEGATOR The negator clause defines the negator of the operator. For example, <> is a negator of =. Consider the following query: SELECT * FROM employee WHERE NOT (dept = 10); Since <> is defined as a negator of =, the optimizer can simplify the preceding query as follows: SELECT * FROM employee WHERE dept <> 10; You can even verify that using the EXPLAIN command: postgres=# EXPLAIN SELECT * FROM employee WHERE NOTdept = 'WATER MGMNT';QUERY PLAN---------------------------------------------------------Foreign Scan on employee (cost=0.00..1.10 rows=1 width=160)Filter: ((dept)::text <> 'WATER MGMNT'::text)Foreign File: /Users/usamadar/testdata.csvForeign File Size: 197(4 rows) Creating index access methods Let's discuss how to index new data types or user-defined types and operators. In PostgreSQL, an index is more of a framework that can be extended or customized for using different strategies. In order to create new index access methods, we have to create an operator class. Let's take a look at a simple example. Let's consider a scenario where you have to store some special data such as an ID or a social security number in the database. The number may contain non-numeric characters, so it is defined as a text type: CREATE TABLE test_ssn (ssn text);INSERT INTO test_ssn VALUES ('222-11-020878');INSERT INTO test_ssn VALUES ('111-11-020978'); Let's assume that the correct order for this data is such that it should be sorted on the last six digits and not the ASCII value of the string. The fact that these numbers need a unique sort order presents a challenge when it comes to indexing the data. This is where PostgreSQL operator classes are useful. An operator allows a user to create a custom indexing strategy. Creating an indexing strategy is about creating your own operators and using them alongside a normal B-tree. Let's start by writing a function that changes the order of digits in the value and also gets rid of the non-numeric characters in the string to be able to compare them better: CREATE OR REPLACE FUNCTION fix_ssn(text)RETURNS text AS $$BEGINRETURN substring($1,8) || replace(substring($1,1,7),'-','');END;$$LANGUAGE 'plpgsql' IMMUTABLE; Let's run the function and verify that it works: testdb=# SELECT fix_ssn(ssn) FROM test_ssn;fix_ssn-------------0208782221102097811111(2 rows) Before an index can be used with a new strategy, we may have to define some more functions depending on the type of index. In our case, we are planning to use a simple B-tree, so we need a comparison function: CREATE OR REPLACE FUNCTION ssn_compareTo(text, text)RETURNS int AS$$BEGINIF fix_ssn($1) < fix_ssn($2)THENRETURN -1;ELSIF fix_ssn($1) > fix_ssn($2)THENRETURN +1;ELSERETURN 0;END IF;END;$$ LANGUAGE 'plpgsql' IMMUTABLE; It's now time to create our operator class: CREATE OPERATOR CLASS ssn_opsFOR TYPE text USING btreeASOPERATOR 1 < ,OPERATOR 2 <= ,OPERATOR 3 = ,OPERATOR 4 >= ,OPERATOR 5 > ,FUNCTION 1 ssn_compareTo(text, text); You can also overload the comparison operators if you need to compare the values in a special way, and use the functions in the compareTo function as well as provide them in the CREATE OPERATOR CLASS command. We will now create our first index using our brand new operator class: CREATE INDEX idx_ssn ON test_ssn (ssn ssn_ops); We can check whether the optimizer is willing to use our special index, as follows: testdb=# SET enable_seqscan=off;testdb=# EXPLAIN SELECT * FROM test_ssn WHERE ssn = '02087822211';QUERY PLAN------------------------------------------------------------------Index Only Scan using idx_ssn on test_ssn (cost=0.13..8.14 rows=1width=32)Index Cond: (ssn = '02087822211'::text)(2 rows) Therefore, we can confirm that the optimizer is able to use our new index. You can read about index access methods in the PostgreSQL documentation at http://www.postgresql.org/docs/current/static/xindex.html. Creating user-defined aggregates User-defined aggregate functions are probably a unique PostgreSQL feature, yet they are quite obscure and perhaps not many people know how to create them. However, once you are able to create this function, you will wonder how you have lived for so long without using this feature. This functionality can be incredibly useful, because it allows you to perform custom aggregates inside the database, instead of querying all the data from the client and doing a custom aggregate in your application code, that is, the number of hits on your website per minute from a specific country. PostgreSQL has a very simple process for defining aggregates. Aggregates can be defined using any functions and in any languages that are installed in the database. Here are the basic steps to building an aggregate function in PostgreSQL: Define a start function that will take in the values of a result set; this function can be defined in any PL language you want. Define an end function that will do something with the final output of the start function. This can be in any PL language you want. Define the aggregate using the CREATE AGGREGATE command, providing the start and end functions you just created. Let's steal an example from the PostgreSQL wiki at http://wiki.postgresql.org/wiki/Aggregate_Median. In this example, we will calculate the statistical median of a set of data. For this purpose, we will define start and end aggregate functions. Let's define the end function first, which takes an array as a parameter and calculates the median. We are assuming here that our start function will pass an array to the following end function: CREATE FUNCTION _final_median(anyarray) RETURNS float8 AS $$WITH q AS(SELECT valFROM unnest($1) valWHERE VAL IS NOT NULLORDER BY 1),cnt AS(SELECT COUNT(*) AS c FROM q)SELECT AVG(val)::float8FROM(SELECT val FROM qLIMIT 2 - MOD((SELECT c FROM cnt), 2)OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) - 1,0)) q2;$$ LANGUAGE sql IMMUTABLE; Now, we create the aggregate as shown in the following code: CREATE AGGREGATE median(anyelement) (SFUNC=array_append,STYPE=anyarray,FINALFUNC=_final_median,INITCOND='{}'); The array_append start function is already defined in PostgreSQL. This function appends an element to the end of an array. In our example, the start function takes all the column values and creates an intermediate array. This array is passed on to the end function, which calculates the median. Now, let's create a table and some test data to run our function: testdb=# CREATE TABLE median_test(t integer);CREATE TABLEtestdb=# INSERT INTO median_test SELECT generate_series(1,10);INSERT 0 10 The generate_series function is a set returning function that generates a series of values, from start to stop with a step size of one. Now, we are all set to test the function: testdb=# SELECT median(t) FROM median_test;median--------5.5(1 row) The mechanics of the preceding example are quite easy to understand. When you run the aggregate, the start function is used to append all the table data from column t into an array using the append_array PostgreSQL built-in. This array is passed on to the final function, _final_median, which calculates the median of the array and returns the result in the same data type as the input parameter. This process is done transparently to the user of the function who simply has a convenient aggregate function available to them. You can read more about the user-defined aggregates in the PostgreSQL documentation in much more detail at http://www.postgresql.org/docs/current/static/xaggr.html. Using foreign data wrappers PostgreSQL foreign data wrappers (FDW) are an implementation of SQL Management of External Data (SQL/MED), which is a standard added to SQL in 2013. FDWs are drivers that allow PostgreSQL database users to read and write data to other external data sources, such as other relational databases, NoSQL data sources, files, JSON, LDAP, and even Twitter. You can query the foreign data sources using SQL and create joins across different systems or even across different data sources. There are several different types of data wrappers developed by different developers and not all of them are production quality. You can see a select list of wrappers on the PostgreSQL wiki at http://wiki.postgresql.org/wiki/Foreign_data_wrappers. Another list of FDWs can be found on PGXN at http://pgxn.org/tag/fdw/. Let's take look at a small example of using file_fdw to access data in a CSV file. First, you need to install the file_fdw extension. If you compiled PostgreSQL from the source, you will need to install the file_fdw contrib module that is distributed with the source. You can do this by going into the contrib/file_fdw folder and running make and make install. If you used an installer or a package for your platform, this module might have been installed automatically. Once the file_fdw module is installed, you will need to create the extension in the database: postgres=# CREATE EXTENSION file_fdw;CREATE EXTENSION Let's now create a sample CSV file that uses the pipe, |, as a separator and contains some employee data: $ cat testdata.csvAARON, ELVIA J|WATER RATE TAKER|WATER MGMNT|81000.00|73862.00AARON, JEFFERY M|POLICE OFFICER|POLICE|74628.00|74628.00AARON, KIMBERLEI R|CHIEF CONTRACT EXPEDITER|FLEETMANAGEMNT|77280.00|70174.00 Now, we should create a foreign server that is pretty much a formality because the file is on the same server. A foreign server normally contains the connection information that a foreign data wrapper uses to access an external data resource. The server needs to be unique within the database: CREATE SERVER file_server FOREIGN DATA WRAPPER file_fdw; The next step, is to create a foreign table that encapsulates our CSV file: CREATE FOREIGN TABLE employee (emp_name VARCHAR,job_title VARCHAR,dept VARCHAR,salary NUMERIC,sal_after_tax NUMERIC) SERVER file_serverOPTIONS (format 'csv',header 'false' , filename '/home/pgbook/14/testdata.csv', delimiter '|', null '');''); The CREATE FOREIGN TABLE command creates a foreign table and the specifications of the file are provided in the OPTIONS section of the preceding code. You can provide the format, and if the first line of the file is a header (header 'false'), in our case there is no file header. We then provide the name and path of the file and the delimiter used in the file, which in our case is the pipe symbol |. In this example, we also specify that the null values should be represented as an empty string. Let's run a SQL command on our foreign table: postgres=# select * from employee;-[ RECORD 1 ]-+-------------------------emp_name | AARON, ELVIA Jjob_title | WATER RATE TAKERdept | WATER MGMNTsalary | 81000.00sal_after_tax | 73862.00-[ RECORD 2 ]-+-------------------------emp_name | AARON, JEFFERY Mjob_title | POLICE OFFICERdept | POLICEsalary | 74628.00sal_after_tax | 74628.00-[ RECORD 3 ]-+-------------------------emp_name | AARON, KIMBERLEI Rjob_title | CHIEF CONTRACT EXPEDITERdept | FLEET MANAGEMNTsalary | 77280.00sal_after_tax | 70174.00 Great, looks like our data is successfully loaded from the file. You can also use the d meta command to see the structure of the employee table: postgres=# d employee;Foreign table "public.employee"Column | Type | Modifiers | FDW Options---------------+-------------------+-----------+-------------emp_name | character varying | |job_title | character varying | |dept | character varying | |salary | numeric | |sal_after_tax | numeric | |Server: file_serverFDW Options: (format 'csv', header 'false',filename '/home/pg_book/14/testdata.csv', delimiter '|',"null" '') You can run explain on the query to understand what is going on when you run a query on the foreign table: postgres=# EXPLAIN SELECT * FROM employee WHERE salary > 5000;QUERY PLAN---------------------------------------------------------Foreign Scan on employee (cost=0.00..1.10 rows=1 width=160)Filter: (salary > 5000::numeric)Foreign File: /home/pgbook/14/testdata.csvForeign File Size: 197(4 rows) The ALTER FOREIGN TABLE command can be used to modify the options. More information about the file_fdw is available at http://www.postgresql.org/docs/current/static/file-fdw.html. You can take a look at the CREATE SERVER and CREATE FOREIGN TABLE commands in the PostgreSQL documentation for more information on the many options available. Each of the foreign data wrappers comes with its own documentation about how to use the wrapper. Make sure that an extension is stable enough before it is used in production. The PostgreSQL core development group does not support most of the FDW extensions. If you want to create your own data wrappers, you can find the documentation at http://www.postgresql.org/docs/current/static/fdwhandler.html as an excellent starting point. The best way to learn, however, is to read the code of other available extensions. Summary This includes the ability to add new operators, new index access methods, and create your own aggregates. You can access foreign data sources, such as other databases, files, and web services using PostgreSQL foreign data wrappers. These wrappers are provided as extensions and should be used with caution, as most of them are not officially supported. Even though PostgreSQL is very extensible, you can't plug in a new storage engine or change the parser/planner and executor interfaces. These components are very tightly coupled with each other and are, therefore, highly optimized and mature. Resources for Article: Further resources on this subject: Load balancing MSSQL [Article] Advanced SOQL Statements [Article] Running a PostgreSQL Database Server [Article]
Read more
  • 0
  • 0
  • 9211

article-image-central-air-and-heating-thermostat
Packt
03 Mar 2015
15 min read
Save for later

Central Air and Heating Thermostat

Packt
03 Mar 2015
15 min read
In this article by Andrew K. Dennis, author of the book Raspberry Pi Home Automation with Arduino Second Edition, you will learn how to build a thermostat device using an Arduino. You will also learn how to use the temperature data to switch relays on and off. Relays are the main components that you can use for interaction between your Arduino and high-voltage electronic devices. The thermostat will also provide a web interface so that you can connect to it and check out the temperature. (For more resources related to this topic, see here.) Introducing the thermostat A thermostat is a control device that is used to manipulate other devices based on a temperature setting. This temperature setting is known as the setpoint. When the temperature changes in relation to the setpoint, a device can be switched on or off. For example, let's imagine a system where a simple thermostat is set to switch an electric heater on when the temperature drops below 25 degrees Celsius. Within our thermostat, we have a temperature-sensing device such as a thermistor that returns a temperature reading every few seconds. When the thermistor reads a temperature below the setpoint (25 degrees Celsius), the thermostat will switch a relay on, completing the circuit between the wall plug and our electric heater and providing it with power. Thus, we can see that a simple electronic thermostat can be used to switch on a variety of devices. Warren S. Johnson, a college professor in Wisconsin, is credited with inventing the electric room thermostat in the 1880s. Johnson was known throughout his lifetime as a prolific inventor who worked in a variety of fields, including electricity. These electric room thermostats became a common feature in homes across the course of the twentieth century as larger parts of the world were hooked up the electricity grid. Now, with open hardware electronic tools such as the Arduino available, we can build custom thermostats for a variety of home projects. They can be used to control baseboard heaters, heat lamps, and air conditioner units. They can also be used for the following: Fish tank heaters Indoor gardens Electric heaters Fans Now that we have explored the uses of thermostats, let's take a look at our project. Setting up our hardware In the following examples, we will list the pins to which you need to connect your hardware. However, we recommend that when you purchase any device such as the Ethernet shield, you check whether certain pins are available or not. Due to the sheer range of hardware available, it is not possible to list every potential hardware combination. Therefore, if the pin in the example is not free, you can update the circuit and source code to use a different pin. When building the example, we also recommend using a breadboard. This will allow you to experiment with building your circuit without having to solder any components. Our first task will be to set up our thermostat device so that it has Ethernet access. Adding the Ethernet shield The Arduino Uno does not contain an Ethernet port. Therefore, you will need a way for your thermostat to be accessible on your home network. One simple solution is to purchase an Ethernet shield and connect it to your microcontroller. There are several shields in the market, including the Arduino Ethernet shield (http://arduino.cc/en/Main/ArduinoEthernetShield) and Seeed Ethernet shield (http://www.seeedstudio.com/wiki/Ethernet_Shield_V1.0). These shields are plugged into the GPIO pins on the Arduino. If you purchase one of these shields, then we would also recommend buying some extra GPIO headers. These are plugged into the existing headers attached to the Ethernet shield. Their purpose is to provide some extra clearance above the Ethernet port on the board so that you can connect other shields in future if you decide to purchase them. Take a board of your choice and attach it to the Arduino Uno. When you plug the USB cable into your microcontroller and into your computer, the lights on both the Uno and Ethernet shield should light up. Now our device has a medium to send and receive data over a LAN. Let's take a look at setting up our thermostat relays. Relays A relay is a type of switch controlled by an electromagnet. It allows us to use a small amount of power to control a much larger amount, for example, using a 9V power supply to switch 220V wall power. Relays are rated to work with different voltages and currents. A relay has three contact points: Normally Open, Common Connection, and Normally Closed. Two of these points will be wired up to our fan. In the context of an Arduino project, the relay will also have a pin for ground, 5V power and a data pin that is used to switch the relay on and off. A popular choice for a relay is the Pololu Basic SPDT Relay Carrier. This can be purchased from http://www.pololu.com/category/135/relay-modules. This relay has featured in some other Packt Publishing books on the Arduino, so it is a good investment. Once you have the relay, you need to wire it up to the microcontroller. Connect a wire from the relay to digital pin 5 on the Arduino, another wire to the GRD pin, and the final wire to the 5V pin. This completes the relay setup. In order to control relays though, we need some data to trigger switching them between on and off. Our thermistor device handles the task of collecting this data. Connecting the thermistor A thermistor is an electronic component that, when included in a circuit, can be used to measure temperature. The device is a type of resistor that has the property whereby its resistance varies as the temperature changes. It can be found in a variety of devices, including thermostats and electronic thermometers. There are two categories of thermistors available: Negative Thermistor Coefficient (NTC) and Positive Thermistor Coefficient (PTC). The difference between them is that as the temperature increases, the resistance decreases in the case of an NTC, and on the other hand, it increases in the case of a PTC. We are going to use a prebuilt digital device with the model number AM2303. This can be purchased at https://www.adafruit.com/products/393. This device reads both temperature and humidity. It also comes with a software library that you can use in your Arduino sketches. One of the benefits of this library is that many functions that precompute values, such as temperature in Celsius, are available and thus don't require you to write a lot of code. Take your AM203 and connect it to the GRD pin, 5V pin and digital pin 4. The following diagram shows how it should be set up: You are now ready to move on to creating the software to test for temperature readings. Setting up our software We now need to write an application in the Arduino IDE to control our new thermostat device. Our software will contain the following: The code responsible for collecting the temperature data Methods to switch relays on and off based on this data Code to handle accepting incoming HTTP requests so that we can view our thermostat's current temperature reading and change the setpoint A method to send our temperature readings to the Raspberry Pi The next step is to hook up our Arduino thermostat with the USB port of the device we installed the IDE on. You may need to temporarily disconnect your relay from the Arduino. This will prevent your thermostat device from drawing too much power from your computer's USB port, which may result in the port being disabled. We now need to download the DHT library that interacts with our AM2303. This can be found on GitHub, at https://github.com/adafruit/DHT-sensor-library. Click on the Download ZIP link and unzip the file to a location on your hard drive. Next, we need to install the library to make it accessible from our sketch: Open the Arduino IDE. Navigate to Sketch | Import Library. Next, click on Add library. Choose the folder on your hard drive. You can now use the library. With the library installed, we can include it in our sketch and access a number of useful functions. Let's now start creating our software. Thermostat software We can start adding some code to the Arduino to control our thermostat. Open a new sketch in the Arduino IDE and perform the following steps: Inside the sketch, we are going to start by adding the code to include the libraries we need to use. At the top of the sketch, add the following code: #include "DHT.h" // Include this if using the AM2302 #include <SPI.h> #include <Ethernet.h> Next, we will declare some variables to be used by our application. These will be responsible for defining:     The pin the AM2303 thermistor is located on     The relay pin     The IP address we want our Arduino to use, which should be unique     The Mac address of the Arduino, which should also be unique     The name of the room the thermostat is located in     The variables responsible for Ethernet communication The IP address will depend on your own home network. Check out your wireless router to see what range of IP addresses is available. Select an address that isn't in use and update the IPAddress variable as follows: #define DHTPIN 4 // The digital pin to read from #define DHTTYPE DHT22 // DHT 22 (AM2302)   unsigned char relay = 5; //The relay pins String room = "library"; byte mac[] = { 0xDE, 0xAD, 0xBE, 0xEF, 0xFE, 0xED }; IPAddress ip(192,168,3,5); DHT dht(DHTPIN, DHTTYPE); EthernetServer server(80); EthernetClient client; We can now include the setup() function. This is responsible for initializing some variables with their default values, and setting the pin to which our relay is connected to output mode: void setup() {   Serial.begin(9600);   Ethernet.begin(mac, ip);   server.begin();   dht.begin();   pinMode(relay, OUTPUT); } The next block of code we will add is the loop() function. This contains the main body of our program to be executed. Here, we will assign a value to the setpoint and grab our temperature readings: void loop() {   int setpoint = 25;   float h = dht.readHumidity();   float t = dht.readTemperature(); Following this, we check whether the temperature is above or below the setpoint and switch the relay on or off as needed. Paste this code below the variables you just added: if(t <setpoint) {   digitalWrite(relay,HIGH); } else {   digitalWrite(relay,LOW); } Next, we need to handle the HTTP requests to the thermostat. We start by collecting all of the incoming data. The following code also goes inside the loop() function: client = server.available(); if (client) {   // an http request ends with a blank line   booleancurrentLineIsBlank = true;   String result;   while (client.connected()) {     if (client.available()) {       char c = client.read();       result= result + c;     } With the incoming request stored in the result variable, we can examine the HTTP header to know whether we are requesting an HTML page or a JSON object. You'll learn more about JavaScript Object Notation (JSON) shortly. If we request an HTML page, this is displayed in the browser. Next, add the following code to your sketch: if(result.indexOf("text/html") > -1) {   client.println("HTTP/1.1 200 OK");   client.println("Content-Type: text/html");   client.println();   if (isnan(h) || isnan(t)) {     client.println("Failed to read from DHT sensor!");     return;   }   client.print("<b>Thermostat</b> set to: ");   client.print(setpoint);    client.print("degrees C <br />Humidity: ");   client.print(h);   client.print(" %t");   client.print("<br />Temperature: ");   client.print(t);   client.println(" degrees C ");   break; } The following code handles a request for the data to be returned in JSON format. Our Raspberry Pi will make HTTP requests to the Arduino, and then process the data returned to it. At the bottom of this last block of code is a statement adding a short delay to allow the Arduino to process the request and close the client connection. Paste this final section of code in your sketch: if( result.indexOf("application/json") > -1 ) { client.println("HTTP/1.1 200 OK"); client.println("Content-Type: application/json;charset=utf-8"); client.println("Server: Arduino"); client.println("Connnection: close"); client.println(); client.print("{"thermostat":[{"location":""); client.print(room); client.print(""},"); client.print("{"temperature":""); client.print(t); client.print(""},"); client.print("{"humidity":""); client.print(h); client.print(""},"); client.print("{"setpoint":""); client.print(setpoint); client.print(""}"); client.print("]}"); client.println(); break;           }     } delay(1); client.stop();   }  } This completes our program. We can now save it and run the Verify process. Click on the small check mark in a circle located in the top-left corner of the sketch. If you have added all of the code correctly, you should see Binary sketch size: 16,962 bytes (of a 32,256 byte maximum). Now that our code is verified and saved, we can look at uploading it to the Arduino, attaching the fan, and testing our thermostat. Testing our thermostat and fan We have our hardware set up and the code ready. Now we can test the thermostat and see it in action with a device connected to the mains electricity. We will first attach a fan and then run the sketch to switch it on and off. Attaching the fan Ensure that your Arduino is powered down and that the fan is not plugged into the wall. Using a wire stripper and cutters, cut one side of the cable that connects the plug to the fan body. Take the end of the cable attached to the plug, and attach it to the NO point on the relay. Use a screwdriver to ensure that it is fastened correctly. Now, take the other portion of the cut cable that is attached to the fan body, and attach this to the COM point. Once again, use a screwdriver to ensure that it is fastened securely to the relay. Your connection should look as follows: You can now reattach your Arduino to the computer via its USB cable. However, do not plug the fan into the wall yet. Starting your thermostat application With the fan connected to our relay, we can upload our sketch and test it: From the Arudino IDE, select the upload icon. Once the code has been uploaded, disconnect your Arduino board. Next, connect an Ethernet cable to your Arduino. Following this, plug the Arduino into the wall to get mains power. Finally, connect the fan to the wall outlet. You should hear the clicking sound of the relay as it switches on or off depending on the room temperature. When the relay switch is on (or off), the fan will follow suit. Using a separate laptop if you have it, or from your Raspberry Pi, access the IP address you specified in the application via a web browser, for example, http://192.168.3.5/. You should see something similar to this: Thermostat set to: 25degrees C  Humidity: 35.70 % Temperature: 14.90 degrees C You can now stimulate the thermistor using an ice cube and hair dryer, to switch the relay on and off, and the fan will follow suit. If you refresh your connection to the IP address, you should see the change in the temperature output on the screen. You can use the F5 key to do this. Let's now test the JSON response. Testing the JSON response A format useful in transferring data between applications is JavaScript Object Notation (JSON). You can read more about this on the official JSON website, at http://www.json.org/. The purpose of us generating data in JSON format is to allow the Raspberry Pi control device we are building to query the thermostat periodically and collect the data being generated. We can verify that we are getting JSON data back from the sketch by making an HTTP request using the application/json header. Load a web browser such as Google Chrome or FireFox. We are going to make an XML HTTP request directly from the browser to our thermostat. This type of request is commonly known as an Asynchronous JavaScript and XML (AJAX) request. It can be used to refresh data on a page without having to actually reload it. In your web browser, locate and open the developer tools. The following link lists the location and shortcut keys in major browsers: http://webmasters.stackexchange.com/questions/8525/how-to-open-the-javascript-console-in-different-browsers In the JavaScript console portion of the developer tools, type the following JavaScript code: var xmlhttp; xmlhttp=new XMLHttpRequest(); xmlhttp.open("POST","192.168.3.5",true); xmlhttp.setRequestHeader("Content-type","application/json"); xmlhttp.onreadystatechange = function() {//Call a function when the state changes.    if(xmlhttp.readyState == 4 &&xmlhttp.status == 200) {          console.log(xmlhttp);    } }; xmlhttp.send() Press the return key or run option to execute the code. This will fire an HTTP request, and you should see a JSON object return: "{"thermostat":     [      {"location":"library"},      {"temperature":"14.90"},      {"humidity":"29.90"},      {"setpoint":"25"}   ] }" This confirms that our application can return data to the Raspberry Pi. We have tested our software and hardware and seen that they are working. Summary In this article, we built a thermostat device. We looked at thermistors, and we learned how to set up an Ethernet connection. To control our thermostat, we wrote an Arduino sketch, uploaded it to the microcontroller, and then tested it with a fan plugged into the mains electricity. Resources for Article: Further resources on this subject: The Raspberry Pi and Raspbian? [article] Clusters Parallel Computing and Raspberry Pi Brief Background [article] The Arduino Mobile Robot [article]
Read more
  • 0
  • 0
  • 21062

article-image-creating-brick-breaking-game
Packt
03 Mar 2015
32 min read
Save for later

Creating a Brick Breaking Game

Packt
03 Mar 2015
32 min read
Have you ever thought about procedurally generated levels? Have you thought about how this could be done, how their logic works, and how their resources are managed? With our example bricks game, you will get to the core point of generating colors procedurally for each block, every time the level gets loaded. Physics has always been a huge and massively important topic in the process of developing a game. However, a brick breaking game can be made in many ways and using the many techniques that the engine can provide, but I choose to make it a physics-based game to cover the usage of the new, unique, and amazing component that Epic has recently added to its engine. The Projectile component is a physics-based component for which you can tweak many attributes to get a huge variation of behaviors that you can use with any game genre. By the end of this article by Muhammad A.Moniem, the author of Learning Unreal Engine iOS Game Development, you will be able to: Build your first multicomponent blueprints Understand more about the game modes Script a touch input Understand the Projectile component in depth Build a simple emissive material Use the dynamic material instances Start using the construction scripts Detect collisions Start adding sound effects to the game Restart a level Have a fully functional gameplay (For more resources related to this topic, see here.) The project structure For this game sample, I made a blank project template and selected to use the starter content so that I could get some cubes, spheres, and all other 3D basic meshes that will be used in the game. So, you will find the project structure still in the same basic structure, and the most important folder where you will find all the content is called Blueprints. Building the blueprints The game, as you might see in the project files, contains only four blueprints. As I said earlier, a blueprint can be an object in your world or even a piece of logic without any physical representation inside the game view. The four blueprints responsible for the game are explained here: ball: This is the blueprint that is responsible for the ball rendering and movement. You can consider it as an entity in the game world, as it has its own representation, which is a 3D ball. platform: This one also has its visual representation in the game world. This is the platform that will receive the player input. levelLayout: This one represents the level itself and its layout, walls, blocks, and game camera. bricksBreakingMode: Every game or level made with Unreal Engine should have a game mode blueprint type. This defines the main player, the controller used to control the gameplay, the pawn that works in the same way as the main player but has no input, the HUD for the main UI controller, and the game state that is useful in multiplayer games. Even if you are using the default setting, it will be better to make a space holder one! Gameplay mechanics I've always been a big fan of planning the code before writing or scripting it. So, I'll try to keep the same habit here as well; before making each game, I'll explain how the gameplay workflow should be. With such a habit, you can figure out the weak points of your logic, even if you didn't build it. It helps you develop quickly and more efficiently. As I mentioned earlier, the game has only three working blueprints, and the fourth one is used to organize the level (which is not gameplay logic and has no logic at all). Here are the steps that the game should follow one by one: At the start of the game, the levelLayout blueprint will start instantiating the bricks and set a different color for each one. The levelLayut blueprint sets the rendering camera to the one we want. The ball blueprint starts moving the ball with a proper velocity and sets a dynamic material for the ball mesh. The platform blueprint starts accepting the input events on a frame-by-frame basis from mouse or touch inputs, and sets a dynamic material for the platform mesh. If the ball blueprint hits any other object, it should never speed up or slow down; it should keep the same speed. If the ball blueprint crossed the bottom line, it should restart the level. If the player pressed the screen or clicked on the mouse, the platform blueprint should move only on the y axis to follow the finger or the mouse cursor. If the ball blueprint hits any brick from the levelLayout blueprint, it should destroy it. The ball plays some sound effects. Depending on the surface it hits, it plays a different sound. Starting a new level As the game will be based on one level only and the engine already gives us this new pretty level with a sky dome and light effects with some basic assets, all of this will not be necessary for our game. So, you need to go to the File menu, select New Level, add it somewhere inside your project files, and give it a special name. In my case, I made a new folder named gameScene to hold my level (or any other levels if my game is a multilevel game) and named it mainLevel. Now, this level will never get loaded into the game without forcing the engine to do that. The Unreal Editor gives you a great set of options to define which is the default map/level to be loaded when the game starts or when the editor runs. Even when you ship the game, the Unreal Editor tells us which levels should be shipped and which levels shouldn't be shipped to save some space. Open the Edit menu and then open Project Settings. When the window pops up, select the Maps & Modes section and set Game Default Map to the newly created level. Editor Startup Map should also have the same level: Building the game mode Although a game mode is a blueprint, I prefer to always separate its creation from the creation of the game blueprints, as it contains zero work for logic or even graphs. A game mode is essential for each level, not only for each game. Right-click in an empty space inside your project directory and select Blueprint under the Basic assets section. When the Pick Parent Class window pops up, select the last type of blueprint, which is called Game Mode, and give your newly created blueprint a name, which, in my case, is bricksBreakingMode. Now, we have a game mode for the game level; this mode will not work at all without being connected to the current level (the empty level I made in the previous section) somehow. Go to World Settings by clicking on the icon in the top shelf of the editor (you need to get used to accessing World Settings, as it has so many options that you will need to tweak them to fit your games):   The World Settings panel will be on the right-hand side of your screen. Scroll down to the Game Mode part and select the one you made from the Game Mode Override drop-down menu. If you cannot find the one you've made, just type its name, and the smart menu will search over the project to find it.   Building the game's main material As the game is an iOS game, we should work with caution when adding elements and code to save the game from any performance overhead, glitches, or crashes. Although the engine can run a game with the Light option on an iOS device, I always prefer to stay as far away as possible from using lights/directional lights in an iOS game, as a directional light source on mealtime would mean recalculating all the vertices. So, if the level has 10k vertices with two directional lights, it will be calculated as 30k vertices. The best way to avoid using a light source for such a simple game like the brick breaking game is to build a special material that can emulate a light emission; this material is called an emissive material. In your project panel, right-click in an empty space (perhaps inside the materialsfolder) and choose a material from the Basic Assets section. Give this material a name (which, in my case, is gameEmissiveMaterial) and then double-click to open the material editor. As you can see, the material editor for a default new material is almost empty, apart from one big node that contains the material outputs with a black colored material. To start adding new nodes, you will need to right-click in an empty space of your editor grid and then either select a node or search for nodes by name; both ways work fine.   The emissive material is just a material with Color and Emissive Color; you can see these names in your output list, which means you will need to connect some sort of nodes or graphs to these two sockets of the material output. Now, add the following three new nodes: VectorParameter: This represents the color; you can pick a color by clicking on the color area on the left-hand panel of the screen or on the Default Value parameter. ScalarParameter: This represents a factor to scale the color of the material; you can set its Default Value to 2, which works fine for the game. Multiply: This will multiply two values (the color and the scalar) to give a value to be used for the emission. With these three nodes in your graph, you might figure out how it works. The basic color has to be added to the base color output, and then the Multiply result of the base color and scalar will be added to the emissive color output of the material: You can rename the nodes and give them special names, which will be useful later on. I named the VectorParameter node BaseColor and the Scalar node EmissiveScalar. You can check out the difference between the emissive material you made and another default material by applying both to two meshes in a level without any light. The default material will light the mesh in black as it expects a light source, but the emissive one will make it colored and shiny. Building the blueprints and components I prefer to call all the blueprints for this game actors as all of them will be based on a class in the engine core. This class usually represents any object with or without logic in the level. Although blueprints based on the actor class are not accepting input, you will learn a way to force any actor blueprint to get input events. In this section, you will build the different blueprints for the game and add components for each one of them. Later on, in another section, you will build the logic and graphs. As I always say, building and setting all the components and the default values should be the first thing you do in any game, and then adding the logic should follow. Do not work on both simultaneously! Building the layout blueprint The layout blueprint should include the bricks that the players are going to break, the camera that renders the level, and the walls that the ball is going to collide with. Start making it by adding an Actor blueprint in your project directory. Name it levelLayout and double-click on it to open the blueprint editor. The blueprint editor, by default, contains the following three subeditors inside it; you can navigate between them via the buttons in the top-right corner: Defaults: This is used to set the default values of the blueprint class type Components: This is used to add different components to build and structure the blueprint Graph: This is where we will add scripting logic The majority of the time, you will be working with the components and graph editors only, as the default editor's default values always work the best:   Open the component graph and start adding these components: Camera: This will be the component that renders the game. As you can see in the preceding screenshot, I added one component and left its name as Camera1. It was set as ROOT of the blueprint; it holds all the other components as children underneath its hierarchy. Changed Values: The only value you need to change in the camera component is Projection Mode. You need to set it to Orthographic, as it will be rendered as a 2D game, and keep Ortho Width as 512, as it will make the screen show all the content in a good size. Feel free to use different values based on the content of your level design. Orthographic cameras work without depth, and they are recommended more in 2D games. On the other hand, the perspective camera has more depth, and it is better to be used with any games with 3D content. Static Mesh: To be able to add meshes as boundaries or triggering areas to collide with the ball, you will need to add cubes to work as collision walls, perhaps hidden walls. The best way to add this is by adding four static meshes and aligning and moving them to build them as a scene stage. Renaming all of them is also a good way to go. To be able to distinguish between them, you can name them as I named them: StaticMeshLeftMargin, StaticMeshRightMargin, StaticMeshTopMargin, and StaticMeshBottomMargin. The first three are the left, right, and top margins; they will be working as collision walls to force the ball to bounce in different directions. However, the bottom one will work as a trigger area to restart the level when the ball passes through it. Changed Values: You need to set Static Mesh for them as the cube and then start to scale and move it to build the scene. For the walls, you need to add the Wall tag for the first three meshes in the Component Tags options area, and for the bottom trigger, you need to add another tag; something like deathTrigger works fine. These tags will be used by the gameplay logic to detect whether the ball hits a wall and you need to play a sound or whether it hits a death area and you need to restart the level. In the Collision section for each static mesh, you need to set both SimulationGeneratesHitEvents and GenerateOverlapEvents to True. Also, for CollisionPreset, you can select BlockAll, as this will create solid walls to block any other object from passing: Finally, from the Rendering options section, you need to select the emissive material we have made to be able to see those static meshes, and you need to mark Hidden in Game as True to hide those objects. Keep in mind that you can keep those objects in the game for debugging reasons, and when you are sure that they are in the correct place, you can move to this option again and remark it as True. Billboard: For now, you can think about the billboard component as a point in space with a representation icon, and this is how it is mostly used inside UE4 as the engine does not support an independent transform component yet. However, billboards have always been used to show the contents that always face the camera, such as particles, text, or any other thing you need to always get rendered from the same angle. As the game will be generating the blocks/bricks during the gameplay, you will need to have some points to define where to build or to start building those bricks. You can add five billboard points, rename them, and rearrange them to look like a column. You don't have to change any values for them, as you will be using their position in space values only! I named those five points as firstRowPoint, SecondRowPoint, thirdRowPoint, fourthRowPoint, and fifthRowPoint. Building the ball blueprint Start making the ball blueprint by adding an Actor blueprint in your project directory. Name it Ball and double-click on it to open the blueprint editor. Then, navigate to the Components subeditor if you are not ready. Start adding the following components to the blueprint: The sphere will work as the collision surface for the Ball blueprint. So, for this reason, you will need to set its Collision option to SimulationGeneratesHitEvents and GenerateOverlapEvents to True. Also, set the CollisionPreset option to BlockAll to act in a manner similar to the walls from the layout blueprint. You need to set the SphereRadius option from the Shape section to 26.0 so that it is of a good size that fits the screen's overall size. The process for adding static meshes is the same as you did earlier, but this time, you will need to select a sphere mesh from the standard assets that came with the project. You will also need to set its material to the project default material you made earlier in this article. Also, after selecting it, you might need to adjust its Scale to 0.5 in all three axes to fit the collision sphere size. Feel free to move the static mesh component on the x, y, and z axes till it fits the collision surface. The projectile movement component is the most important one for the Ball blueprint, or perhaps it is the most important one throughout this article, as it is the one responsible for the ball movement and velocity and for its physics behaviors. After adding the components, you will need to make some tweaks to it to allow it to give the behavior that matches the game. Keep in mind that any small amount of change in values or variables will lead you to have a completely different behavior, so feel free to play through the values and test them to get some crazy ideas about what you can achieve and what you can get. For changed values, you need to set Projectile Gravity Scale to 0.0 from within the Projectile options; this will allow the ball to fly in the air without a gravity force to bring it down (or any other direction for a custom gravity). For Projectile Bounces, you will need to mark Should Bounce as True. In this case, the projectile physics will be forced to keep bouncing with the amount of bounciness you set. As you want the ball to keep bouncing over the walls, you need to set the value to 1.0 to give it full bounciness power: From the Velocity section, you will need to enter a velocity for the ball to start using when the game runs; otherwise, the ball will never move. As you want the first bounce of the ball to be towards the blocks, you need to set the Z value to a high number, such as 300, and give it more level design sense. It shouldn't bounce in a vertical line, so it is better to give some force on the horizontal axis Y as well as move the ball in a diagonal direction. So, let's add 300 into Y as well. Building the platform blueprint Start making the platform blueprint by adding an Actor blueprint in your project directory. Name it platform and double-click on it to open the blueprint editor. Then, navigate to the Components subeditor if you are not there already. You will add only one component, and it will work for everything. You want to add a Static Mesh component, but this time, you will be selecting the Pipe mesh; you can select whatever you want, but the pipe works the best. Don't forget to set its material to be the same emissive material as we used earlier to be able to see it in the game view, and set its Collision option to SimulationGeneratesHitEvents and GenerateOverlapEvents to True. Also, CollisionPreset should be set to BlockAll to act in the same manner as the walls from the layout blueprint. Building the graphs and logic Now, as all the blueprints have been set up with their components, it's time to start adding the gameplay logic/scripting. However, to be able to see the result of what you are going to build, you first need to drag and drop the three blueprints inside your scene and organize them to look like an actual level. As the engine is a 3D engine and there is no support yet for 2D physics, you might notice that I added two extra objects to the scene (giant cubes), which I named depthPreservingCube and depthPreservingCube2. These objects are here basically to prevent the ball from moving in the depth axis, which is X in Unreal Editor. This is how both the new preserving cubes look from a top view: One general step that you will perform for all blueprints is to set the dynamic material for them. As you know, you made only one material and applied it to the platform and to the ball. However, you also want both to look different during the gameplay. Changing the material color right now will change both objects' visibility. However, changing it during the gameplay via the construction script and the dynamic material instances feature will allow you to have many colors for many different objects, but they will still share the same material. So, in this step, you will make the platform blueprint and the ball blueprint. I'll explain how to make it for the ball, and you will perform the same steps to make it for the platform. Select the ball blueprint first and double-click to open the editor; then, this time navigate to the subeditor graphs to start working with the nodes. You will see that there are two major tabs inside the graph; one of them is named Construction Script. This unique tab is responsible for the construction of the blueprint itself. Open the Construction Script tab that always has a Construction Script node by default; then, drag and drop the StaticMesh component of the ball from the panel on the left-hand side. This will cause you to have a small context menu that has only two options: Get and Set. Select Get, and this will add a reference to the static mesh. Now, drag a line from Construction Script, leave it in an empty space, add a Create Dynamic Material Instance node from the context menu, and set its Source Material option to the material we want to instance (which is the emissive material). However, keep in mind that if you are using a later version, Epic introduces a more easy way to access the Create Dynamic Material Instance node by just dragging a line from Static Mesh-ball inside Graph, and not Construction Script. Now, connect the static mesh to be the target and drag a line out of Return Value of the Create Dynamic Material Instance node. From the context menu, select the first option, which is Promote to a Variable; this will add a variable to the left-panel list. Feel free to give it a name you can recognize, which, in my case, is thisColor. Now, the whole thing should look like this: Now that you've created the dynamic material instance, you need to set the new color for it. To do this, you need to go back to the event graph and start adding the logic for it. I'll add it to the ball also, and you need to apply it again in Event Graph of the platform blueprint. Add an Event Begin Play node, which is responsible for the execution of some procedurals when the game starts. Drag a wire out of it and select the Set Vector Parameter Value node that is responsible for setting the value for the material. Now, add a reference for the thisColor variable and connect it to Target of the Set Vector Parameter Value node. Last but not least, enter Parameter name that you used to build the material, which, in my case, is BaseColor. Finally, set Value to a color you like; I picked yellow for the ball. Which color would you like to pick? The layout blueprint graph Before you start working with this section, you need to make several copies of the material we made earlier and give each one its own color. I made six different ones to give a variation of six colors to the blocks. The scripts here will be responsible for creating the blocks, changing their colors, and finally, setting the game view to the current camera. To serve this goal, you need to add several variables with several types. Here are some variables: numberOfColumns: This is an integer variable that has a default value of six, which is the total number of columns per row. currentProgressBlockPosition: This is a vector type variable to hold the position of the last created block. It is very important because you are going to add blocks one after the other, so you want to define the position of the last block and then add spacing to it. aBlockMaterial: This is the material that will be applied to a specific block. materialRandomIndex: This is a random integer value to be used for procedural selected colors for each block. To make things more organized, I managed to make several custom events. You can think about them as a set of functions; each one has a block of procedurals to execute: Initialize The Blocks: This Custom Event node has a set of for loops that are working one by one on initializing the target blocks when the game starts. Each loop cycles six times from Index 0 to the number of columns index. When it is finished, it runs the next loop. Each loop body is a custom function itself, and they all run the same set of procedurals, except that they use a different row. chooseRandomMaterial: This custom event handles the process of picking a random material to be applied to in the process of creation. It works by setting a random value between 1 and 6 to the materialRandomIndex variable, and depending on the selected value, the aBlockMaterial variable will be set to a different material. This aBlockMaterial variable is the one that will be used to set the material of each created block in each iteration of the loop for each row. addRowX: I named this X here, but in fact, there are five functions to add the rows; they are addRow1, addRow2, addRow3, addRow4, and addRow5. All of them are responsible for adding rows; the main difference is the start point of adding the row; each one of them uses a different billboard transform, starting from firstRowPoint and ending with fifthRowPoint. You need to connect your first node as Add Static Mesh and set its properties as any other static mesh. You need to set its material to the emissive one. Set Static Mesh to Shape_Pipe_180, give it a brickPiece tag, and set its Collision options to Simulation Generates Hit Events and Generate Overlap Events to True. Also, Collision Preset has to be set to Block All to act in the same manner as the walls from the layout blueprint and receive the hit events, which will be the core of the ball detection. This created mesh will need a transform point to be instantiated in its cords. This is where you will need to pick the row point transform reference (depending on your row, you will select the point number), add it to a Make Transform node, and finally, set the new transform Y Rotation to -90 and its XYZ scale to 0.7, 0.7, 0.5 to fit the correct size and flip the block to have a better convex look. This second part of the addRow event should use the ChooseRandomMaterial custom event that you already made to select a material from among six random ones. Then, you can execute SetMaterial, make its Target the same mesh that was created via Add Static Mesh, and set its Material to aBlockMaterial; the material changes every time the chooseRandomMaterial event gets called. Finally, you can use SetRelativeLocation of the billboard point that is responsible for that row to another position on the y axis, using the Make Vector and Add Int(+) nodes to add 75 units every time as a spacing between every two created blocks: Now, if you check the project files, you will find that the only difference is that there are five functions called addRow, and each of them uses a different billboard as a starting point to add the blocks. Now, if you run the version you made or the one within the project files, you will be able to see the generated blocks, and each time you stop and run the game, you will get a completely different color variation of the blocks. There is one last thing to completely finish this blueprint. As you might have noticed, this blueprint contains the camera in its components. This means it should be the one that holds the functionality of setting this camera to be the rendering camera. So, in EvenBeginPlay, this functionality will be fired when the level starts. You need to connect the the Set View Target With Blend node that will set the camera to the Target camera, and you need to connect Get Player Controller (player 0 is the player number 1) to the Target socket. This blueprint refers to New View Target. Finally, you need to call the initializeTheBlocks custom event, which will call all the other functions. Congratulations! Now you have built your first functional and complex blueprint that contains the main and important functionalities everyone must use in any game. Also, you got the trick of how you can randomly generate or change things such as the color of the blocks to make the levels feel different every time. The Ball blueprint graph The main event node that will be used in the ball graph is Event Hit, which will be fired automatically every time the ball collider hits another collider. If you still remember, while creating the platform, walls, and blocks, we used to add tags for every static mesh to define them. Those names are used now. Using a node called Component Has Tag, we can compare the object component that the ball has hit with the value of the Component Has Tag node, and then, we either get a positive or negative result. So, this is how it should work: Whenever the ball gets hit with another collider, check whether it is a brickPiece tagged component. If this is true, then disable the collision of the brick piece via the Set Collision Enabled node and set it to No Collision to stop responding to any other collisions. Then, hide the brick mesh using the Set Visibility node and keep the New Visibility option unmarked, which means that it will be hidden. Then, play a sound effect of the hit to make it a more dynamic gameplay. You can play sound in many different ways, but let's use the Play Sound at Location node now, use the location of the ball itself, and use the hitBrick sound effect from the Audio folder by assigning it to the Sound slot of the Play Sound at Location node. Finally, reset the velocity of the ball using the Set Velocity node referenced by the Projectile Movement component and set it to XYZ 300, 0, 300: If it wasn't a brickPiece tag, then let's check whether it is Component Has Tag of Wall. If this is the case, then let's use Play Sound at Location, use the location of the ball itself, and use the hitBlockingWall sound effect from the Audio folder by assigning it to the Sound slot of the Play Sound at Location node: If it wasn't tagged with Wall, then check whether it is finally tagged with deathTrigger. If this is the case, then the player has missed it, and the ball is not below the platform. So, you can use the Open Level node to load the level again and assign the level name as mainLevel (or any other level you want to load) to the Level Name slot: The platform blueprint graph The platform blueprint will be the one that receives the input from the player. You just need to define the player input to make the blueprint able to receive those events from the mouse, touch, or any other available input device. To do this, there are two ways, and I always like to use both these ways: Enable input node: I assume that you've already added the scripting nodes inside Event graph to set the dynamic material color via Set Vector Parameter Value. This means you already have an Event Begin Play node, so you need to connect its network to another node called Enable Input; this node is responsible for forcing the current blueprint to accept input events. Finally, you can set its Player Controller value to a Get Player Controller node and leave Player Index as 0 for the player number 1: Autoreceive input option: By selecting the platform blueprint instance that you've dropped inside the scene from the Scene Outliner, you will see that it has many options in the Details panel on the right-hand side. By changing the Auto Receive Input option to Player 0 under the Input option, this will have the same effect as the previous solution: Now, we can build the logic for the platform movement, and anything that is built can be tested directly in the editor or on the device. I prefer to break the logic into two pieces, and this will make it easier than it looks like for you: Get the touch state: In this phase, you will use the Input Touch event that can be executed when a touch gets pressed or released. So based on the touch state, you will check via a Branch node whether the state is True or False. Your condition for this node should be Touch 1 index, as the game will not need more than one touch. Based on the state, I would like to set a custom Boolean variable named Touched and set its value to match the touch state. Then, you can add a Gate node to control the execution of the following procedurals based on the touch state (Pressed or Released) by connecting the two cases with the Open gate and the Close gate execution sockets. Finally, you can set the actor location and set it to use the Self actor as its target (which is the platform actor/blueprint) to change the platform location based on touches. Defining the New Location value is the next chunk of the logic: Actor location: Using a Make Vector node, you can construct a new point position in the world made of X, Y, and Z coordinates. As the y axis will be the horizontal position, which will be based on the player's touch, only this needs to be changed over time. However, the X and Z positions will stay the same all the time, as the platform will never move vertically or in depth. The new vector position will be based on the touch phase. If the player is pressing, then the position should be matching the touch input position. However, if the players are not pressing, then the position should be the same as the last point the player had pressed. I managed to make a float variable named horizontalAxis; this variable will hold the correct Y position to be added to the Make Vector node. If the player is pressing the screen, then you need to get the finger press position by returning Impact Point by Break Hit Result via a Get Hit Result Under FingerBy Channel node from the current active player. However, if the player is not touching the screen, then the horizontalAxis variable should stay the same as the last-know location for the Self actor. Then, it will set as it is into the MakeVector Y position value: Now, you can save and build all the blueprints. Don't hesitate now or any time during the process of building the game logic to build or launch the game into a real device to check where you are. The best way to learn more about the nodes and those minor changes is by building all the time into the divide and changing some values every time. Summary In this article, you went through the process of building your first Unreal iOS game. Also, you got used to making blueprints by adding nodes in different ways, connecting nodes, and adding several component types into the blueprint and changing its values. Also, you learned how to enable input in an actor blueprint and get the touch and mouse input and fit them to your custom use. You also got your hands on one of the most famous and powerful rendering techniques in the editor, which is called dynamic material instancing. You learned how to make a custom material and change its parameters whenever you want. Procedurally, changing the look of the level is something interesting nowadays, and we barely scratched its surface by setting different materials every time we load the level. Resources for Article: Further resources on this subject: UnrealScript Game Programming Cookbook [article] Unreal Development Toolkit: Level Design HQ [article] The Unreal Engine [article]
Read more
  • 0
  • 0
  • 8162
article-image-quick-start-guide-flume
Packt
02 Mar 2015
15 min read
Save for later

A Quick Start Guide to Flume

Packt
02 Mar 2015
15 min read
In this article by Steve Hoffman, the author of the book, Apache Flume: Distributed Log Collection for Hadoop - Second Edition, we will learn about the basics that are required to be known before we start working with Apache Flume. This article will help you get started with Flume. So, let's start with the first step: downloading and configuring Flume. (For more resources related to this topic, see here.) Downloading Flume Let's download Flume from http://flume.apache.org/. Look for the download link in the side navigation. You'll see two compressed .tar archives available along with the checksum and GPG signature files used to verify the archives. Instructions to verify the download are on the website, so I won't cover them here. Checking the checksum file contents against the actual checksum verifies that the download was not corrupted. Checking the signature file validates that all the files you are downloading (including the checksum and signature) came from Apache and not some nefarious location. Do you really need to verify your downloads? In general, it is a good idea and it is recommended by Apache that you do so. If you choose not to, I won't tell. The binary distribution archive has bin in the name, and the source archive is marked with src. The source archive contains just the Flume source code. The binary distribution is much larger because it contains not only the Flume source and the compiled Flume components (jars, javadocs, and so on), but also all the dependent Java libraries. The binary package contains the same Maven POM file as the source archive, so you can always recompile the code even if you start with the binary distribution. Go ahead, download and verify the binary distribution to save us some time in getting started. Flume in Hadoop distributions Flume is available with some Hadoop distributions. The distributions supposedly provide bundles of Hadoop's core components and satellite projects (such as Flume) in a way that ensures things such as version compatibility and additional bug fixes are taken into account. These distributions aren't better or worse; they're just different. There are benefits to using a distribution. Someone else has already done the work of pulling together all the version-compatible components. Today, this is less of an issue since the Apache BigTop project started (http://bigtop.apache.org/). Nevertheless, having prebuilt standard OS packages, such as RPMs and DEBs, ease installation as well as provide startup/shutdown scripts. Each distribution has different levels of free and paid options, including paid professional services if you really get into a situation you just can't handle. There are downsides, of course. The version of Flume bundled in a distribution will often lag quite a bit behind the Apache releases. If there is a new or bleeding-edge feature you are interested in using, you'll either be waiting for your distribution's provider to backport it for you, or you'll be stuck patching it yourself. Furthermore, while the distribution providers do a fair amount of testing, such as any general-purpose platform, you will most likely encounter something that their testing didn't cover, in which case, you are still on the hook to come up with a workaround or dive into the code, fix it, and hopefully, submit that patch back to the open source community (where, at a future point, it'll make it into an update of your distribution or the next version). So, things move slower in a Hadoop distribution world. You can see that as good or bad. Usually, large companies don't like the instability of bleeding-edge technology or making changes often, as change can be the most common cause of unplanned outages. You'd be hard pressed to find such a company using the bleeding-edge Linux kernel rather than something like Red Hat Enterprise Linux (RHEL), CentOS, Ubuntu LTS, or any of the other distributions whose target is stability and compatibility. If you are a startup building the next Internet fad, you might need that bleeding-edge feature to get a leg up on the established competition. If you are considering a distribution, do the research and see what you are getting (or not getting) with each. Remember that each of these offerings is hoping that you'll eventually want and/or need their Enterprise offering, which usually doesn't come cheap. Do your homework. Here's a short, nondefinitive list of some of the more established players. For more information, refer to the following links: Cloudera: http://cloudera.com/ Hortonworks: http://hortonworks.com/ MapR: http://mapr.com/ An overview of the Flume configuration file Now that we've downloaded Flume, let's spend some time going over how to configure an agent. A Flume agent's default configuration provider uses a simple Java property file of key/value pairs that you pass as an argument to the agent upon startup. As you can configure more than one agent in a single file, you will need to additionally pass an agent identifier (called a name) so that it knows which configurations to use. In my examples where I'm only specifying one agent, I'm going to use the name agent. By default, the configuration property file is monitored for changes every 30 seconds. If a change is detected, Flume will attempt to reconfigure itself. In practice, many of the configuration settings cannot be changed after the agent has started. Save yourself some trouble and pass the undocumented --no-reload-conf argument when starting the agent (except in development situations perhaps). If you use the Cloudera distribution, the passing of this flag is currently not possible. I've opened a ticket to fix that at https://issues.cloudera.org/browse/DISTRO-648. If this is important to you, please vote it up. Each agent is configured, starting with three parameters: agent.sources=<list of sources>agent.channels=<list of channels>agent.sinks=<list of sinks> Each source, channel, and sink also has a unique name within the context of that agent. For example, if I'm going to transport my Apache access logs, I might define a channel named access. The configurations for this channel would all start with the agent.channels.access prefix. Each configuration item has a type property that tells Flume what kind of source, channel, or sink it is. In this case, we are going to use an in-memory channel whose type is memory. The complete configuration for the channel named access in the agent named agent would be: agent.channels.access.type=memory Any arguments to a source, channel, or sink are added as additional properties using the same prefix. The memory channel has a capacity parameter to indicate the maximum number of Flume events it can hold. Let's say we didn't want to use the default value of 100; our configuration would now look like this: agent.channels.access.type=memoryagent.channels.access.capacity=200 Finally, we need to add the access channel name to the agent.channels property so that the agent knows to load it: agent.channels=access Let's look at a complete example using the canonical "Hello, World!" example. Starting up with "Hello, World!" No technical article would be complete without a "Hello, World!" example. Here is the configuration file we'll be using: agent.sources=s1agent.channels=c1agent.sinks=k agent.sources.s1.type=netcatagent.sources.s1.channels=c1agent.sources.s1.bind=0.0.0.0agent.sources.s1.port=1234 agent.channels.c1.type=memory agent.sinks.k1.type=loggeragent.sinks.k1.channel=c1 Here, I've defined one agent (called agent) who has a source named s1, a channel named c1, and a sink named k1. The s1 source's type is netcat, which simply opens a socket listening for events (one line of text per event). It requires two parameters: a bind IP and a port number. In this example, we are using 0.0.0.0 for a bind address (the Java convention to specify listen on any address) and port 12345. The source configuration also has a parameter called channels (plural), which is the name of the channel(s) the source will append events to, in this case, c1. It is plural, because you can configure a source to write to more than one channel; we just aren't doing that in this simple example. The channel named c1 is a memory channel with a default configuration. The sink named k1 is of the logger type. This is a sink that is mostly used for debugging and testing. It will log all events at the INFO level using Log4j, which it receives from the configured channel, in this case, c1. Here, the channel keyword is singular because a sink can only be fed data from one channel. Using this configuration, let's run the agent and connect to it using the Linux netcat utility to send an event. First, explode the .tar archive of the binary distribution we downloaded earlier: $ tar -zxf apache-flume-1.5.2-bin.tar.gz$ cd apache-flume-1.5.2-bin Next, let's briefly look at the help. Run the flume-ng command with the help command: $ ./bin/flume-ng helpUsage: ./bin/flume-ng <command> [options]... commands:help                 display this help textagent                run a Flume agentavro-client           run an avro Flume clientversion               show Flume version info global options:--conf,-c <conf>     use configs in <conf> directory--classpath,-C <cp>   append to the classpath--dryrun,-d          do not actually start Flume, just print the command--plugins-path <dirs> colon-separated list of plugins.d directories. See the                       plugins.d section in the user guide for more details.                       Default: $FLUME_HOME/plugins.d-Dproperty=value     sets a Java system property value-Xproperty=value     sets a Java -X option agent options:--conf-file,-f <file> specify a config file (required)--name,-n <name>     the name of this agent (required)--help,-h             display help text avro-client options:--rpcProps,-P <file>   RPC client properties file with server connection params--host,-H <host>       hostname to which events will be sent--port,-p <port>       port of the avro source--dirname <dir>       directory to stream to avro source--filename,-F <file>   text file to stream to avro source (default: std input)--headerFile,-R <file> File containing event headers as key/value pairs on each new line--help,-h             display help text Either --rpcProps or both --host and --port must be specified. Note that if <conf> directory is specified, then it is always included first in the classpath. As you can see, there are two ways with which you can invoke the command (other than the simple help and version commands). We will be using the agent command. The use of avro-client will be covered later. The agent command has two required parameters: a configuration file to use and the agent name (in case your configuration contains multiple agents). Let's take our sample configuration and open an editor (vi in my case, but use whatever you like): $ vi conf/hw.conf Next, place the contents of the preceding configuration into the editor, save, and exit back to the shell. Now you can start the agent: $ ./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console The -Dflume.root.logger property overrides the root logger in conf/log4j.properties to use the console appender. If we didn't override the root logger, everything would still work, but the output would go to the log/flume.log file instead of being based on the contents of the default configuration file. Of course, you can edit the conf/log4j.properties file and change the flume.root.logger property (or anything else you like). To change just the path or filename, you can set the flume.log.dir and flume.log.file properties in the configuration file or pass additional flags on the command line as follows: $ ./bin/flume-ng agent -n agent -c conf -f conf/hw.conf -Dflume.root.logger=INFO,console -Dflume.log.dir=/tmp -Dflume.log.file=flume-agent.log You might ask why you need to specify the -c parameter, as the -f parameter contains the complete relative path to the configuration. The reason for this is that the Log4j configuration file should be included on the class path. If you left the -c parameter off the command, you'll see this error: Warning: No configuration directory set! Use --conf <dir> to override.log4j:WARN No appenders could be found for logger (org.apache.flume.lifecycle.LifecycleSupervisor).log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info But you didn't do that so you should see these key log lines: 2014-10-05 15:39:06,109 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:140)] Post-validation flume configuration contains configuration foragents: [agent] This line tells you that your agent starts with the name agent. Usually you'd look for this line only to be sure you started the right configuration when you have multiple configurations defined in your configuration file. 2014-10-05 15:39:06,076 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)] Reloadingconfiguration file:conf/hw.conf This is another sanity check to make sure you are loading the correct file, in this case our hw.conf file. 2014-10-05 15:39:06,221 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:138)]Starting new configuration:{ sourceRunners:{s1=EventDrivenSourceRunner: { source:org.apache.flume.source.NetcatSource{name:s1,state:IDLE} }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@442fbe47 counterGroup:{ name:null counters:{} } }}channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} } Once all the configurations have been parsed, you will see this message, which shows you everything that was configured. You can see s1, c1, and k1, and which Java classes are actually doing the work. As you probably guessed, netcat is a convenience for org.apache.flume.source.NetcatSource. We could have used the class name if we wanted. In fact, if I had my own custom source written, I would use its class name for the source's type parameter. You cannot define your own short names without patching the Flume distribution. 2014-10-05 15:39:06,427 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)] CreatedserverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345] Here, we see that our source is now listening on port 12345 for the input. So, let's send some data to it. Finally, open a second terminal. We'll use the nc command (you can use Telnet or anything else similar) to send the Hello World string and press the Return (Enter) key to mark the end of the event: % nc localhost 12345Hello WorldOK The OK message came from the agent after we pressed the Return key, signifying that it accepted the line of text as a single Flume event. If you look at the agent log, you will see the following: 2014-10-05 15:44:11,215 (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 48 65 6C 6C 6F 20 57 6F 72 6C 64Hello World } This log message shows you that the Flume event contains no headers (NetcatSource doesn't add any itself). The body is shown in hexadecimal along with a string representation (for us humans to read, in this case, our Hello World message). If I send the following line and then press the Enter key, you'll get an OK message: The quick brown fox jumped over the lazy dog. You'll see this in the agent's log: 2014-10-05 15:44:57,232 (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]Event: { headers:{} body: 54 68 65 20 71 75 69 63 6B 20 62 72 6F 77 6E 20The quick brown } The event appears to have been truncated. The logger sink, by design, limits the body content to 16 bytes to keep your screen from being filled with more than what you'd need in a debugging context. If you need to see the full contents for debugging, you should use a different sink, perhaps the file_roll sink, which would write to the local filesystem. Summary In this article, we covered how to download the Flume binary distribution. We created a simple configuration file that included one source writing to one channel, feeding one sink. The source listened on a socket for network clients to connect to and to send it event data. These events were written to an in-memory channel and then fed to a Log4j sink to become the output. We then connected to our listening agent using the Linux netcat utility and sent some string events to our Flume agent's source. Finally, we verified that our Log4j-based sink wrote the events out. Resources for Article: Further resources on this subject: About Cassandra [article] Introducing Kafka [article] Transformation [article]
Read more
  • 0
  • 0
  • 7160

Packt
02 Mar 2015
19 min read
Save for later

Entity Framework DB First – Inheritance Relationships between Entities

Packt
02 Mar 2015
19 min read
This article is written by Rahul Rajat Singh, the author of Mastering Entity Framework. So far, we have seen how we can use various approaches of Entity Framework, how we can manage database table relationships, and how to perform model validations using Entity Framework. In this article, we will see how we can implement the inheritance relationship between the entities. We will see how we can change the generated conceptual model to implement the inheritance relationship, and how it will benefit us in using the entities in an object-oriented manner and the database tables in a relational manner. (For more resources related to this topic, see here.) Domain modeling using inheritance in Entity Framework One of the major challenges while using a relational database is to manage the domain logic in an object-oriented manner when the database itself is implemented in a relational manner. ORMs like Entity Framework provide the strongly typed objects, that is, entities for the relational tables. However, it might be possible that the entities generated for the database tables are logically related to each other, and they can be better modeled using inheritance relationships rather than having independent entities. Entity Framework lets us create inheritance relationships between the entities, so that we can work with the entities in an object-oriented manner, and internally, the data will get persisted in the respective tables. Entity Framework provides us three ways of object relational domain modeling using the inheritance relationship: The Table per Type (TPT) inheritance The Table per Class Hierarchy (TPH) inheritance The Table per Concrete Class (TPC) inheritance Let's now take a look at the scenarios where the generated entities are not logically related, and how we can use these inheritance relationships to create a better domain model by implementing inheritance relationships between entities using the Entity Framework Database First approach. The Table per Type inheritance The Table per Type (TPT) inheritance is useful when our database has tables that are related to each other using a one-to-one relationship. This relation is being maintained in the database by a shared primary key. To illustrate this, let's take a look at an example scenario. Let's assume a scenario where an organization maintains a database of all the people who work in a department. Some of them are employees getting a fixed salary, and some of them are vendors who are hired at an hourly rate. This is modeled in the database by having all the common data in a table called Person, and there are separate tables for the data that is specific to the employees and vendors. Let's visualize this scenario by looking at the database schema: The database schema showing the TPT inheritance database schema The ID column for the People table can be an auto-increment identity column, but it should not be an auto-increment identity column for the Employee and Vendors tables. In the preceding figure, the People table contains all the data common to both type of worker. The Employee table contains the data specific to the employees and the Vendors table contains the data specific to the vendors. These tables have a shared primary key and thus, there is a one-to-one relationship between the tables. To implement the TPT inheritance, we need to perform the following steps in our application: Generate the default Entity Data Model. Delete the default relationships. Add the inheritance relationship between the entities. Use the entities via the DBContext object. Generating the default Entity Data Model Let's add a new ADO.NET Entity Data Model to our application, and generate the conceptual Entity Model for these tables. The default generated Entity Model will look like this: The generated Entity Data Model where the TPT inheritance could be used Looking at the preceding conceptual model, we can see that Entity Framework is able to figure out the one-to-one relationship between the tables and creates the entities with the same relationship. However, if we take a look at the generated entities from our application domain perspective, it is fairly evident that these entities can be better managed if they have an inheritance relationship between them. So, let's see how we can modify the generated conceptual model to implement the inheritance relationship, and Entity Framework will take care of updating the data in the respective tables. Deleting default relationships The first thing we need to do to create the inheritance relationship is to delete the existing relationship from the Entity Model. This can be done by right-clicking on the relationship and selecting Delete from Model as follows: Deleting an existing relationship from the Entity Model Adding inheritance relationships between entities Once the relationships are deleted, we can add the new inheritance relationships in our Entity Model as follows: Adding inheritance relationships in the Entity Model When we add an inheritance relationship, the Visual Entity Designer will ask for the base class and derived class as follows: Selecting the base class and derived class participating in the inheritance relationship Once the inheritance relationship is created, the Entity Model will look like this: Inheritance relationship in the Entity Model After creating the inheritance relationship, we will get a compile error that the ID property is defined in all the entities. To resolve this problem, we need to delete the ID column from the derived classes. This will still keep the ID column that maps the derived classes as it is. So, from the application perspective, the ID column is defined in the base class but from the mapping perspective, it is mapped in both the base class and derived class, so that the data will get inserted into tables mapped in both the base and derived entities. With this inheritance relationship in place, the entities can be used in an object-oriented manner, and Entity Framework will take care of updating the respective tables for each entity. Using the entities via the DBContext object As we know, DbContext is the primary class that should be used to perform various operations on entities. Let's try to use our SampleDbContext class to create an Employee and a Vendor using this Entity Model and see how the data gets updated in the database: using (SampleDbEntities db = new SampleDbEntities()) { Employee employee = new Employee(); employee.FirstName = "Employee 1"; employee.LastName = "Employee 1"; employee.PhoneNumber = "1234567"; employee.Salary = 50000; employee.EmailID = "employee1@test.com"; Vendor vendor = new Vendor(); vendor.FirstName = "vendor 1"; vendor.LastName = "vendor 1"; vendor.PhoneNumber = "1234567"; vendor.HourlyRate = 100; vendor.EmailID = "vendor1@test.com"; db.Workers.Add(employee); db.Workers.Add(vendor); db.SaveChanges(); } In the preceding code, what we are doing is creating an object of the Employee and Vendor type, and then adding them to People using the DbContext object. What Entity Framework will do internally is that it will look at the mappings of the base entity and the derived entities, and then push the respective data into the respective tables. So, if we take a look at the data inserted in the database, it will look like the following: A database snapshot of the inserted data It is clearly visible from the preceding database snapshot that Entity Framework looks at our inheritance relationship and pushes the data into the Person, Employee, and Vendor tables. The Table per Class Hierarchy inheritance The Table per Class Hierarchy (TPH) inheritance is modeled by having a single database table for all the entity classes in the inheritance hierarchy. The TPH inheritance is useful in cases where all the information about the related entities is stored in a single table. For example, using the earlier scenario, let's try to model the database in such a way that it will only contain a single table called Workers to store the Employee and Vendor details. Let's try to visualize this table: A database schema showing the TPH inheritance database schema Now what will happen in this case is that the common fields will be populated whenever we create a type of worker. Salary will only contain a value if the worker is of type Employee. The HourlyRate field will be null in this case. If the worker is of type Vendor, then the HourlyRate field will have a value, and Salary will be null. This pattern is not very elegant from a database perspective. Since we are trying to keep unrelated data in a single table, our table is not normalized. There will always be some redundant columns that contain null values if we use this approach. We should try not to use this pattern unless it is absolutely needed. To implement the TPH inheritance relationship using the preceding table structure, we need to perform the following activities: Generate the default Entity Data Model. Add concrete classes to the Entity Data Model. Map the concrete class properties to their respective tables and columns. Make the base class entity abstract. Use the entities via the DBContext object. Let's discuss this in detail. Generating the default Entity Data Model Let's now generate the Entity Data Model for this table. The Entity Framework will create a single entity, Worker, for this table: The generated model for the table created for implementing the TPH inheritance Adding concrete classes to the Entity Data Model From the application perspective, it would be a much better solution if we have classes such as Employee and Vendor, which are derived from the Worker entity. The Worker class will contain all the common properties, and Employee and Vendor will contain their respective properties. So, let's add new entities for Employee and Vendor. While creating the entity, we can specify the base class entity as Worker, which is as follows: Adding a new entity in the Entity Data Model using a base class type Similarly, we will add the Vendor entity to our Entity Data Model, and specify the Worker entity as its base class entity. Once the entities are generated, our conceptual model will look like this: The Entity Data Model after adding the derived entities Next, we have to remove the Salary and HourlyRate properties from the Worker entity, and put them in the Employee and the Vendor entities respectively. So, once the properties are put into the respective entities, our final Entity Data model will look like this: The Entity Data Model after moving the respective properties into the derived entities Mapping the concrete class properties to the respective tables and columns After this, we have to define the column mappings in the derived classes to let the derived classes know which table and column should be used to put the data. We also need to specify the mapping condition. The Employee entity should save the Salary property's value in the Salary column of the Workers table when the Salary property is Not Null and HourlyRate is Null: Table mapping and conditions to map the Employee entity to the respective tables Once this mapping is done, we have to mark the Salary property as Nullable=false in the entity property window. This will let Entity Framework know that if someone is creating an object of the Employee type, then the Salary field is mandatory: Setting the Employee entity properties as Nullable Similarly, the Vendor entity should save the HourlyRate property's value in the HourlyRate column of the Workers table when Salary is Null and HourlyRate is Not Null: Table mapping and conditions to map the Vendor entity to the respective tables And similar to the Employee class, we also have to mark the HourlyRate property as Nullable=false in the Entity Property window. This will help Entity Framework know that if someone is creating an object of the Vendor type, then the HourlyRate field is mandatory: Setting the Vendor entity properties to Nullable Making the base class entity abstract There is one last change needed to be able to use these models. To be able to use these models, we need to mark the base class as abstract, so that Entity Framework is able to resolve the object of Employee and Vendors to the Workers table. Making the base class Workers as abstract This will also be a better model from the application perspective because the Worker entity itself has no meaning from the application domain perspective. Using the entities via the DBContext object Now we have our Entity Data Model configured to use the TPH inheritance. Let's try to create an Employee object and a Vendor object, and add them to the database using the TPH inheritance hierarchy: using (SampleDbEntities db = new SampleDbEntities()){Employee employee = new Employee();employee.FirstName = "Employee 1";employee.LastName = "Employee 1";employee.PhoneNumber = "1234567";employee.Salary = 50000;employee.EmailID = "employee1@test.com";Vendor vendor = new Vendor();vendor.FirstName = "vendor 1";vendor.LastName = "vendor 1";vendor.PhoneNumber = "1234567";vendor.HourlyRate = 100;vendor.EmailID = "vendor1@test.com";db.Workers.Add(employee);db.Workers.Add(vendor);db.SaveChanges();} In the preceding code, we created objects of the Employee and Vendor types, and then added them to the Workers collection using the DbContext object. Entity Framework will look at the mappings of the base entity and the derived entities, will check the mapping conditions and the actual values of the properties, and then push the data to the respective tables. So, let's take a look at the data inserted in the Workers table: A database snapshot after inserting the data using the Employee and Vendor entities So, we can see that for our Employee and Vendor models, the actual data is being kept in the same table using Entity Framework's TPH inheritance. The Table per Concrete Class inheritance The Table per Concrete Class (TPC) inheritance can be used when the database contains separate tables for all the logical entities, and these tables have some common fields. In our existing example, if there are two separate tables of Employee and Vendor, then the database schema would look like the following: The database schema showing the TPC inheritance database schema One of the major problems in such a database design is the duplication of columns in the tables, which is not recommended from the database normalization perspective. To implement the TPC inheritance, we need to perform the following tasks: Generate the default Entity Data Model. Create the abstract class. Modify the CDSL to cater to the change. Specify the mapping to implement the TPT inheritance. Use the entities via the DBContext object. Generating the default Entity Data Model Let's now take a look at the generated entities for this database schema: The default generated entities for the TPC inheritance database schema Entity Framework has given us separate entities for these two tables. From our application domain perspective, we can use these entities in a better way if all the common properties are moved to a common abstract class. The Employee and Vendor entities will contain the properties specific to them and inherit from this abstract class to use all the common properties. Creating the abstract class Let's add a new entity called Worker to our conceptual model and move the common properties into this entity: Adding a base class for all the common properties Next, we have to mark this class as abstract from the properties window: Marking the base class as abstract class Modifying the CDSL to cater to the change Next, we have to specify the mapping for these tables. Unfortunately, the Visual Entity Designer has no support for this type of mapping, so we need to perform this mapping ourselves in the EDMX XML file. The conceptual schema definition language (CSDL) part of the EDMX file is all set since we have already moved the common properties into the abstract class. So, now we should be able to use these properties with an abstract class handle. The problem will come in the storage schema definition language (SSDL) and mapping specification language (MSL). The first thing that we need to do is to change the SSDL to let Entity Framework know that the abstract class Worker is capable of saving the data in two tables. This can be done by setting the EntitySet name in the EntityContainer tags as follows: <EntityContainer Name="todoDbModelStoreContainer">   <EntitySet Name="Employee" EntityType="Self.Employee" Schema="dbo" store_Type="Tables" />   <EntitySet Name="Vendor" EntityType="Self.Vendor" Schema="dbo" store_Type="Tables" /></EntityContainer> Specifying the mapping to implement the TPT inheritance Next, we need to change the MSL to properly map the properties to the respective tables based on the actual type of object. For this, we have to specify EntitySetMapping. The EntitySetMapping should look like the following: <EntityContainerMapping StorageEntityContainer="todoDbModelStoreContainer" CdmEntityContainer="SampleDbEntities">    <EntitySetMapping Name="Workers">   <EntityTypeMapping TypeName="IsTypeOf(SampleDbModel.Vendor)">       <MappingFragment StoreEntitySet="Vendor">       <ScalarProperty Name="HourlyRate" ColumnName="HourlyRate" />       <ScalarProperty Name="EMailId" ColumnName="EMailId" />       <ScalarProperty Name="PhoneNumber" ColumnName="PhoneNumber" />       <ScalarProperty Name="LastName" ColumnName="LastName" />       <ScalarProperty Name="FirstName" ColumnName="FirstName" />       <ScalarProperty Name="ID" ColumnName="ID" />       </MappingFragment>   </EntityTypeMapping>      <EntityTypeMapping TypeName="IsTypeOf(SampleDbModel.Employee)">       <MappingFragment StoreEntitySet="Employee">       <ScalarProperty Name="ID" ColumnName="ID" />       <ScalarProperty Name="Salary" ColumnName="Salary" />       <ScalarProperty Name="EMailId" ColumnName="EMailId" />       <ScalarProperty Name="PhoneNumber" ColumnName="PhoneNumber" />       <ScalarProperty Name="LastName" ColumnName="LastName" />       <ScalarProperty Name="FirstName" ColumnName="FirstName" />       </MappingFragment>   </EntityTypeMapping>   </EntitySetMapping></EntityContainerMapping> In the preceding code, we specified that if the actual type of object is Vendor, then the properties should map to the columns in the Vendor table, and if the actual type of entity is Employee, the properties should map to the Employee table, as shown in the following screenshot: After EDMX modifications, the mapping are visible in Visual Entity Designer If we now open the EDMX file again, we can see the properties being mapped to the respective tables in the respective entities. Doing this mapping from Visual Entity Designer is not possible, unfortunately. Using the entities via the DBContext object Let's use these "entities from our code: using (SampleDbEntities db = new SampleDbEntities()) { Employee employee = new Employee(); employee.FirstName = "Employee 1"; employee.LastName = "Employee 1"; employee.PhoneNumber = "1234567"; employee.Salary = 50000; employee.EMailId = "employee1@test.com"; Vendor vendor = new Vendor(); vendor.FirstName = "vendor 1"; vendor.LastName = "vendor 1"; vendor.PhoneNumber = "1234567"; vendor.HourlyRate = 100; vendor.EMailId = "vendor1@test.com"; db.Workers.Add(employee); db.Workers.Add(vendor); db.SaveChanges(); } In the preceding code, we created objects of the Employee and Vendor types and saved them using the Workers entity set, which is actually an abstract class. If we take a look at the inserted database, we will see the following: Database snapshot of the inserted data using TPC inheritance From the preceding screenshot, it is clear that the data is being pushed to the respective tables. The insert operation we saw in the previous code is successful but there will be an exception in the application. This exception is because when Entity Framework tries to access the values that are in the abstract class, it finds two records with same ID, and since the ID column is specified as a primary key, two records with the same value is a problem in this scenario. This exception clearly shows that the store/database generated identity columns will not work with the TPC inheritance. If we want to use the TPC inheritance, then we either need to use GUID based IDs, or pass the ID from the application, or perhaps use some database mechanism that can maintain the uniqueness of auto-generated columns across multiple tables. Choosing the inheritance strategy Now that we know about all the inheritance strategies supported by Entity Framework, let's try to analyze these approaches. The most important thing is that there is no single strategy that will work for all the scenarios. Especially if we have a legacy database. The best option would be to analyze the application requirements and then look at the existing table structure to see which approach is best suited. The Table per Class Hierarchy inheritance tends to give us denormalized tables and have redundant columns. We should only use it when the number of properties in the derived classes is very less, so that the number of redundant columns is also less, and this denormalized structure will not create problems over a period of time. Contrary to TPH, if we have a lot of properties specific to derived classes and only a few common properties, we can use the Table per Concrete Class inheritance. However, in this approach, we will end up with some properties being repeated in all the tables. Also, this approach imposes some limitations such as we cannot use auto-increment identity columns in the database. If we have a lot of common properties that could go into a base class and a lot of properties specific to derived classes, then perhaps Table per Type is the best option to go with. In any case, complex inheritance relationships that become unmanageable in the long run should be avoided. One alternative could be to have separate domain models to implement the application logic in an object-oriented manner, and then use mappers to map these domain models to Entity Framework's generated entity models. Summary In this article, we looked at the various types of inheritance relationship using Entity Framework. We saw how these inheritance relationships can be implemented, and some guidelines on which should be used in which scenario. Resources for Article: Further resources on this subject: Working with Zend Framework 2.0 [article] Hosting the service in IIS using the TCP protocol [article] Applying LINQ to Entities to a WCF Service [article]
Read more
  • 0
  • 0
  • 15753

article-image-building-color-picker-hex-rgb-conversion
Packt
02 Mar 2015
18 min read
Save for later

Building a Color Picker with Hex RGB Conversion

Packt
02 Mar 2015
18 min read
In this article by Vijay Joshi, author of the book Mastering jQuery UI, we are going to create a color selector, or color picker, that will allow the users to change the text and background color of a page using the slider widget. We will also use the spinner widget to represent individual colors. Any change in colors using the slider will update the spinner and vice versa. The hex value of both text and background colors will also be displayed dynamically on the page. (For more resources related to this topic, see here.) This is how our page will look after we have finished building it: Setting up the folder structure To set up the folder structure, follow this simple procedure: Create a folder named Article inside the MasteringjQueryUI folder. Directly inside this folder, create an HTML file and name it index.html. Copy the js and css folder inside the Article folder as well. Now go inside the js folder and create a JavaScript file named colorpicker.js. With the folder setup complete, let's start to build the project. Writing markup for the page The index.html page will consist of two sections. The first section will be a text block with some text written inside it, and the second section will have our color picker controls. We will create separate controls for text color and background color. Inside the index.html file write the following HTML code to build the page skeleton: <html> <head> <link rel="stylesheet" href="css/ui-lightness/jquery-ui- 1.10.4.custom.min.css"> </head> <body> <div class="container"> <div class="ui-state-highlight" id="textBlock"> <p> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </p> <p> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </p> <p> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </p> </div> <div class="clear">&nbsp;</div> <ul class="controlsContainer"> <li class="left"> <div id="txtRed" class="red slider" data-spinner="sptxtRed" data-type="text"></div><input type="text" value="0" id="sptxtRed" data-slider="txtRed" readonly="readonly" /> <div id="txtGreen" class="green slider" dataspinner=" sptxtGreen" data-type="text"></div><input type="text" value="0" id="sptxtGreen" data-slider="txtGreen" readonly="readonly" /> <div id="txtBlue" class="blue slider" dataspinner=" sptxtBlue" data-type="text"></div><input type="text" value="0" id="sptxtBlue" data-slider="txtBlue" readonly="readonly" /> <div class="clear">&nbsp;</div> Text Color : <span>#000000</span> </li> <li class="right"> <div id="bgRed" class="red slider" data-spinner="spBgRed" data-type="bg" ></div><input type="text" value="255" id="spBgRed" data-slider="bgRed" readonly="readonly" /> <div id="bgGreen" class="green slider" dataspinner=" spBgGreen" data-type="bg" ></div><input type="text" value="255" id="spBgGreen" data-slider="bgGreen" readonly="readonly" /> <div id="bgBlue" class="blue slider" data-spinner="spBgBlue" data-type="bg" ></div><input type="text" value="255" id="spBgBlue" data-slider="bgBlue" readonly="readonly" /> <div class="clear">&nbsp;</div> Background Color : <span>#ffffff</span> </li> </ul> </div> <script src="js/jquery-1.10.2.js"></script> <script src="js/jquery-ui-1.10.4.custom.min.js"></script> <script src="js/colorpicker.js"></script> </body> </html> We started by including the jQuery UI CSS file inside the head section. Proceeding to the body section, we created a div with the container class, which will act as parent div for all the page elements. Inside this div, we created another div with id value textBlock and a ui-state-highlight class. We then put some text content inside this div. For this example, we have made three paragraph elements, each having some random text inside it. After div#textBlock, there is an unordered list with the controlsContainer class. This ul element has two list items inside it. First list item has the CSS class left applied to it and the second has CSS class right applied to it. Inside li.left, we created three div elements. Each of these three div elements will be converted to a jQuery slider and will represent the red (R), green (G), and blue (B) color code, respectively. Next to each of these divs is an input element where the current color code will be displayed. This input will be converted to a spinner as well. Let's look at the first slider div and the input element next to it. The div has id txtRed and two CSS classes red and slider applied to it. The red class will be used to style the slider and the slider class will be used in our colorpicker.js file. Note that this div also has two data attributes attached to it, the first is data-spinner, whose value is the id of the input element next to the slider div we have provided as sptxtRed, the second attribute is data-type, whose value is text. The purpose of the data-type attribute is to let us know whether this slider will be used for changing the text color or the background color. Moving on to the input element next to the slider now, we have set its id as sptxtRed, which should match the value of the data-spinner attribute on the slider div. It has another attribute named data-slider, which contains the id of the slider, which it is related to. Hence, its value is txtRed. Similarly, all the slider elements have been created inside div.left and each slider has an input next to id. The data-type attribute will have the text value for all sliders inside div.left. All input elements have also been assigned a value of 0 as the initial text color will be black. The same pattern that has been followed for elements inside div.left is also followed for elements inside div.right. The only difference is that the data-type value will be bg for slider divs. For all input elements, a value of 255 is set as the background color is white in the beginning. In this manner, all the six sliders and the six input elements have been defined. Note that each element has a unique ID. Finally, there is a span element inside both div.left and div.right. The hex color code will be displayed inside it. We have placed #000000 as the default value for the text color inside the span for the text color and #ffffff as the default value for the background color inside the span for background color. Lastly, we have included the jQuery source file, the jQuery UI source file, and the colorpicker.js file. With the markup ready, we can now write the properties for the CSS classes that we used here. Styling the content To make the page presentable and structured, we need to add CSS properties for different elements. We will do this inside the head section. Go to the head section in the index.html file and write these CSS properties for different elements: <style type="text/css">   body{     color:#025c7f;     font-family:Georgia,arial,verdana;     width:700px;     margin:0 auto;   }   .container{     margin:0 auto;     font-size:14px;     position:relative;     width:700px;     text-align:justify;    } #textBlock{     color:#000000;     background-color: #ffffff;   }   .ui-state-highlight{     padding: 10px;     background: none;   }   .controlsContainer{       border: 1px solid;       margin: 0;       padding: 0;       width: 100%;       float: left;   }   .controlsContainer li{       display: inline-block;       float: left;       padding: 0 0 0 50px;       width: 299px;   }   .controlsContainer div.ui-slider{       margin: 15px 0 0;       width: 200px;       float:left;   }   .left{     border-right: 1px solid;   }   .clear{     clear: both;   }     .red .ui-slider-range{ background: #ff0000; }   .green .ui-slider-range{ background: #00ff00; }   .blue .ui-slider-range{ background: #0000ff; }     .ui-spinner{       height: 20px;       line-height: 1px;       margin: 11px 0 0 15px;     }   input[type=text]{     margin-top: 0;     width: 30px;   } </style> First, we defined some general rules for page body and div .container. Then, we defined the initial text color and background color for the div with id textBlock. Next, we defined the CSS properties for the unordered list ul .controlsContainer and its list items. We have provided some padding and width to each list item. We have also specified the width and other properties for the slider as well. Since the class ui-slider is added by jQuery UI to a slider element after it is initialized, we have added our properties in the .controlsContainer div .ui-slider rule. To make the sliders attractive, we then defined the background colors for each of the slider bars by defining color codes for red, green, and blue classes. Lastly, CSS rules have been defined for the spinner and the input box. We can now check our progress by opening the index.html page in our browser. Loading it will display a page that resembles the following screenshot: It is obvious that sliders and spinners will not be displayed here. This is because we have not written the JavaScript code required to initialize those widgets. Our next section will take care of them. Implementing the color picker In order to implement the required functionality, we first need to initialize the sliders and spinners. Whenever a slider is changed, we need to update its corresponding spinner as well, and conversely if someone changes the value of the spinner, we need to update the slider to the correct value. In case any of the value changes, we will then recalculate the current color and update the text or background color depending on the context. Defining the object structure We will organize our code using the object literal. We will define an init method, which will be the entry point. All event handlers will also be applied inside this method. To begin with, go to the js folder and open the colorpicker.js file for editing. In this file, write the code that will define the object structure and a call to it: var colorPicker = {   init : function ()   {       },   setColor : function(slider, value)   {   },   getHexColor : function(sliderType)   {   },   convertToHex : function (val)   {   } }   $(function() {   colorPicker.init(); }); An object named colorPicker has been defined with four methods. Let's see what all these methods will do: init: This method will be the entry point where we will initialize all components and add any event handlers that are required. setColor: This method will be the main method that will take care of updating the text and background colors. It will also update the value of the spinner whenever the slider moves. This method has two parameters; the slider that was moved and its current value. getHexColor: This method will be called from within setColor and it will return the hex code based on the RGB values in the spinners. It takes a sliderType parameter based on which we will decide which color has to be changed; that is, text color or background color. The actual hex code will be calculated by the next method. convertToHex: This method will convert an RGB value for color into its corresponding hex value and return it to get a HexColor method. This was an overview of the methods we are going to use. Now we will implement these methods one by one, and you will understand them in detail. After the object definition, there is the jQuery's $(document).ready() event handler that will call the init method of our object. The init method In the init method, we will initialize the sliders and the spinners and set the default values for them as well. Write the following code for the init method in the colorpicker.js file:   init : function () {   var t = this;   $( ".slider" ).slider(   {     range: "min",     max: 255,     slide : function (event, ui)     {       t.setColor($(this), ui.value);     },     change : function (event, ui)     {       t.setColor($(this), ui.value);     }   });     $('input').spinner(   {     min :0,     max : 255,     spin : function (event, ui)     {       var sliderRef = $(this).data('slider');       $('#' + sliderRef).slider("value", ui.value);     }   });       $( "#txtRed, #txtGreen, #txtBlue" ).slider('value', 0);   $( "#bgRed, #bgGreen, #bgBlue" ).slider('value', 255); } In the first line, we stored the current scope value, this, in a local variable named t. Next, we will initialize the sliders. Since we have used the CSS class slider on each slider, we can simply use the .slider selector to select all of them. During initialization, we provide four options for sliders: range, max, slide, and change. Note the value for max, which has been set to 255. Since the value for R, G, or B can be only between 0 and 255, we have set max as 255. We do not need to specify min as it is 0 by default. The slide method has also been defined, which is invoked every time the slider handle moves. The call back for slide is calling the setColor method with an instance of the current slider and the value of the current slider. The setColor method will be explained in the next section. Besides slide, the change method is also defined, which also calls the setColor method with an instance of the current slider and its value. We use both the slide and change methods. This is because a change is called once the user has stopped sliding the slider handle and the slider value has changed. Contrary to this, the slide method is called each time the user drags the slider handle. Since we want to change colors while sliding as well, we have defined the slide as well as change methods. It is time to initialize the spinners now. The spinner widget is initialized with three properties. These are min and max, and the spin. min and max method has been set to 0 and 255, respectively. Every time the up/down button on the spinner is clicked or the up/down arrow key is used, the spin method will be called. Inside this method, $(this) refers to the current spinner. We find our related slider to this spinner by reading the data-slider attribute of this spinner. Once we get the exact slider, we set its value using the value method on the slider widget. Note that calling the value method will invoke the change method of the slider as well. This is the primary reason we have defined a callback for the change event while initializing the sliders. Lastly, we will set the default values for the sliders. For sliders inside div.left, we have set the value as 0 and for sliders inside div.right, the value is set to 255. You can now check the page on your browser. You will find that the slider and the spinner elements are initialized now, with the values we specified: You can also see that changing the spinner value using either the mouse or the keyboard will update the value of the slider as well. However, changing the slider value will not update the spinner. We will handle this in the next section where we will change colors as well. Changing colors and updating the spinner The setColor method is called each time the slider or the spinner value changes. We will now define this method to change the color based on whether the slider's or spinner's value was changed. Go to the setColor method declaration and write the following code: setColor : function(slider, value) {   var t = this;   var spinnerRef = slider.data('spinner');   $('#' + spinnerRef).spinner("value", value);     var sliderType = slider.data('type')     var hexColor = t.getHexColor(sliderType);   if(sliderType == 'text')   {       $('#textBlock').css({'color' : hexColor});       $('.left span:last').text(hexColor);                  }   else   {       $('#textBlock').css({'background-color' : hexColor});       $('.right span:last').text(hexColor);                  } } In the preceding code, we receive the current slider and its value as a parameter. First we get the related spinner to this slider using the data attribute spinner. Then we set the value of the spinner to the current value of the slider. Now we find out the type of slider for which setColor is being called and store it in the sliderType variable. The value for sliderType will either be text, in case of sliders inside div.left, or bg, in case of sliders inside div.right. In the next line, we will call the getHexColor method and pass the sliderType variable as its argument. The getHexColor method will return the hex color code for the selected color. Next, based on the sliderType value, we set the color of div#textBlock. If the sliderType is text, we set the color CSS property of div#textBlock and display the selected hex code in the span inside div.left. If the sliderType value is bg, we set the background color for div#textBlock and display the hex code for the background color in the span inside div.right. The getHexColor method In the preceding section, we called the getHexColor method with the sliderType argument. Let's define it first, and then we will go through it in detail. Write the following code to define the getHexColor method: getHexColor : function(sliderType) {   var t = this;   var allInputs;   var hexCode = '#';   if(sliderType == 'text')   {     //text color     allInputs = $('.left').find('input[type=text]');   }   else   {     //background color     allInputs = $('.right').find('input[type=text]');   }   allInputs.each(function (index, element) {     hexCode+= t.convertToHex($(element).val());   });     return hexCode; } The local variable t has stored this to point to the current scope. Another variable allInputs is declared, and lastly a variable to store the hex code has been declared, whose value has been set to # initially. Next comes the if condition, which checks the value of parameter sliderType. If the value of sliderType is text, it means we need to get all the spinner values to change the text color. Hence, we use jQuery's find selector to retrieve all input boxes inside div.left. If the value of sliderType is bg, it means we need to change the background color. Therefore, the else block will be executed and all input boxes inside div.right will be retrieved. To convert the color to hex, individual values for red, green, and blue will have to be converted to hex and then concatenated to get the full color code. Therefore, we iterate in inputs using the .each method. Another method convertToHex is called, which converts the value of a single input to hex. Inside the each method, we keep concatenating the hex value of the R, G, and B components to a variable hexCode. Once all iterations are done, we return the hexCode to the parent function where it is used. Converting to hex convertToHex is a small method that accepts a value and converts it to the hex equivalent. Here is the definition of the convertToHex method: convertToHex : function (val) {   var x  = parseInt(val, 10).toString(16);   return x.length == 1 ? "0" + x : x; } Inside the method, firstly we will convert the received value to an integer using the parseInt method and then we'll use JavaScript's toString method to convert it to hex, which has base 16. In the next line, we will check the length of the converted hex value. Since we want the 6-character dash notation for color (such as #ff00ff), we need two characters each for red, green, and blue. Hence, we check the length of the created hex value. If it is only one character, we append a 0 to the beginning to make it two characters. The hex value is then returned to the parent function. With this, our implementation is complete and we can check it on a browser. Load the page in your browser and play with the sliders and spinners. You will see the text or background color changing, based on their value: You will also see the hex code displayed below the sliders. Also note that changing the sliders will change the value of the corresponding spinner and vice versa. Improving the Colorpicker This was a very basic tool that we built. You can add many more features to it and enhance its functionality. Here are some ideas to get you started: Convert it into a widget where all the required DOM for sliders and spinners is created dynamically Instead of two sliders, incorporate the text and background changing ability into a single slider with two handles, but keep two spinners as usual Summary In this article, we created a basic color picker/changer using sliders and spinners. You can use it to view and change the colors of your pages dynamically. Resources for Article: Further resources on this subject: Testing Ui Using WebdriverJs? [article] Important Aspect Angularjs Ui Development [article] Kendo Ui Dataviz Advance Charting [article]
Read more
  • 0
  • 0
  • 5586
article-image-model-view-viewmodel
Packt
02 Mar 2015
24 min read
Save for later

Model-View-ViewModel

Packt
02 Mar 2015
24 min read
In this article, by Einar Ingebrigtsen, author of the book, SignalR Blueprints, we will focus on a different programming model for client development: Model-View-ViewModel (MVVM). It will reiterate what you have already learned about SignalR, but you will also start to see a recurring theme in how you should architect decoupled software that adheres to the SOLID principles. It will also show the benefit of thinking in single page application terms (often referred to as Single Page Application (SPA)), and how SignalR really fits well with this idea. (For more resources related to this topic, see here.) The goal – an imagined dashboard A counterpart to any application is often a part of monitoring its health. Is it running? and are there any failures?. Getting this information in real time when the failure occurs is important and also getting some statistics from it is interesting. From a SignalR perspective, we will still use the hub abstraction to do pretty much what we have been doing, but the goal is to give ideas of how and what we can use SignalR for. Another goal is to dive into the architectural patterns, making it ready for larger applications. MVVM allows better separation and is very applicable for client development in general. A question that you might ask yourself is why KnockoutJS instead of something like AngularJS? It boils down to the personal preference to a certain degree. AngularJS is described as a MVW where W stands for Whatever. I find AngularJS less focused on the same things I focus on and I also find it very verbose to get it up and running. I'm not in any way an expert in AngularJS, but I have used it on a project and I found myself writing a lot to make it work the way I wanted it to in terms of MVVM. However, I don't think it's fair to compare the two. KnockoutJS is very focused in what it's trying to solve, which is just a little piece of the puzzle, while AngularJS is a full client end-to-end framework. On this note, let's just jump straight to it. Decoupling it all MVVM is a pattern for client development that became very popular in the XAML stack, enabled by Microsoft based on Martin Fowlers presentation model. Its principle is that you have a ViewModel that holds the state and exposes behavior that can be utilized from a view. The view observes any changes of the state the ViewModel exposes, making the ViewModel totally unaware that there is a view. The ViewModel is decoupled and can be put in isolation and is perfect for automated testing. As part of the state that the ViewModel typically holds is the model part, which is something it usually gets from the server, and a SignalR hub is the perfect transport to get this. It boils down to recognizing the different concerns that make up the frontend and separating it all. This gives us the following diagram: Back to basics This time we will go back in time, going down what might be considered a more purist path; use the browser elements (HTML, JavaScript, and CSS) and don't rely on any server-side rendering. Clients today are powerful and very capable and offloading the composition of what the user sees onto the client frees up server resources. You can also rely on the infrastructure of the Web for caching with static HTML files not rendered by the server. In fact, you could actually put these resources on a content delivery network, making the files available as close as possible to the end user. This would result in better load times for the user. You might have other reasons to perform server-side rendering and not just plain HTML. Leveraging existing infrastructure or third-party party tools could be those reasons. It boils down to what's right for you. But this particular sample will focus on things that the client can do. Anyways, let's get started. Open Visual Studio and create a new project by navigating to FILE | New | Project. The following dialog box will show up: From the left-hand side menu, select Web and then ASP.NET Web Application. Enter Chapter4 in the Name textbox and select your location. Select the Empty template from the template selector and make sure you deselect the Host in the cloud option. Then, click on OK, as shown in the following screenshot: Setting up the packages First, we want Twitter bootstrap. To get this, follow these steps: Add a NuGet package reference. Right-click on References in Solution Explorer and select Manage NuGet Packages and type Bootstrap in the search dialog box. Select it and then click on Install. We want a slightly different look, so we'll download one of the many bootstrap themes out here. Add a NuGet package reference called metro-bootstrap. As jQuery is still a part of this, let's add a NuGet package reference to it as well. For the MVVM part, we will use something called KnockoutJS; add it through NuGet as well. Add a NuGet package reference, as in the previous steps, but this time, type SignalR in the search dialog box. Find the package called Microsoft ASP.NET SignalR. Making any SignalR hubs available for the client Add a file called Startup.cs file to the root of the project. Add a Configuration method that will expose any SignalR hubs, as follows: public void Configuration(IAppBuilder app) { app.MapSignalR(); } At the top of the Startup.cs file, above the namespace declaration, but right below the using statements, add the following code:  [assembly: OwinStartupAttribute(typeof(Chapter4.Startup))] Knocking it out of the park KnockoutJS is a framework that implements a lot of the principles found in MVVM and makes it easier to apply. We're going to use the following two features of KnockoutJS, and it's therefore important to understand what they are and what significance they have: Observables: In order for a view to be able to know when state change in a ViewModel occurs, KnockoutJS has something called an observable for single objects or values and observable array for arrays. BindingHandlers: In the view, the counterparts that are able to recognize the observables and know how to deal with its content are known as BindingHandlers. We create binding expression in the view that instructs the view to get its content from the properties found in the binding context. The default binding context will be the ViewModel, but there are more advanced scenarios where this changes. In fact, there is a BindingHandler that enables you to specify the context at any given time called with. Our single page Whether one should strive towards having an SPA is widely discussed on the Web these days. My opinion on the subject, in the interest of the user, is that we should really try to push things in this direction. Having not to post back and cause a full reload of the page and all its resources and getting into the correct state gives the user a better experience. Some of the arguments to perform post-backs every now and then go in the direction of fixing potential memory leaks happening in the browser. Although, the technique is sound and the result is right, it really just camouflages a problem one has in the system. However, as with everything, it really depends on the situation. At the core of an SPA is a single page (pun intended), which is usually the index.html file sitting at the root of the project. Add the new index.html file and edit it as follows: Add a new HTML file (index.html) at the root of the project by right- clicking on the Chapter4 project in Solution Explorer. Navigate to Add | New Item | Web from the left-hand side menu, and then select HTML Page and name it index.html. Finally, click on Add. Let's put in the things we've added dependencies to, starting with the style sheets. In the index.html file, you'll find the <head> tag; add the following code snippet under the <title></title> tag: <link href="Content/bootstrap.min.css" rel="stylesheet" /> <link href="Content/metro-bootstrap.min.css" rel="stylesheet" /> Next, add the following code snippet right beneath the preceding code: <script type="text/javascript" src="Scripts/jquery- 1.9.0.min.js"></script> <script type="text/javascript" src="Scripts/jquery.signalR- 2.1.1.js"></script> <script type="text/javascript" src="signalr/hubs"></script> <script type="text/javascript" src="Scripts/knockout- 3.2.0.js"></script> Another thing we will need in this is something that helps us visualize things; Google has a free, open source charting library that we will use. We will take a dependency to the JavaScript APIs from Google. To do this, add the following script tag after the others: <script type="text/javascript" src="https://www.google.com/jsapi"></script> Now, we can start filling in the view part. Inside the <body> tag, we start by putting in a header, as shown here: <div class="navbar navbar-default navbar-static-top bsnavbar">     <div class="container">         <div class="navbar-header">             <h1>My Dashboard</h1>         </div>     </div> </div> The server side of things In this little dashboard thing, we will look at web requests, both successful and failed. We will perform some minor things for us to be able to do this in a very naive way, without having to flesh out a full mechanism to deal with error situations. Let's start by enabling all requests even static resources, such as HTML files, to run through all HTTP modules. A word of warning: there are performance implications of putting all requests through the managed pipeline, so normally, you wouldn't necessarily want to do this on a production system, but for this sample, it will be fine to show the concepts. Open Web.config in the project and add the following code snippet within the <configuration> tag: <system.webServer>   <modules runAllManagedModulesForAllRequests="true" /> </system.webServer> The hub In this sample, we will only have one hub, the one that will be responsible for dealing with reporting requests and failed requests. Let's add a new class called RequestStatisticsHub. Right-click on the project in Solution Explorer, select Class from Add, name it RequestStatisticsHub.cs, and then click on Add. The new class should inherit from the hub. Add the following using statement at the top: using Microsoft.AspNet.SignalR; We're going to keep a track of the count of requests and failed requests per time with a resolution of not more than every 30 seconds in the memory on the server. Obviously, if one wants to scale across multiple servers, this is way too naive and one should choose an out-of-process shared key-value store that goes across servers. However, for our purpose, this will be fine. Let's add a using statement at the top, as shown here: using System.Collections.Generic; At the top of the class, add the two dictionaries that we will use to hold this information: static Dictionary<string, int> _requestsLog = new Dictionary<string, int>(); static Dictionary<string, int> _failedRequestsLog = new Dictionary<string, int>(); In our client, we want to access these logs at startup. So let's add two methods to do so: public Dictionary<string, int> GetRequests() {     return _requestsLog; }   public Dictionary<string, int> GetFailedRequests() {     return _failedRequestsLog; } Remember the resolution of only keeping track of number of requests per 30 seconds at a time. There is no default mechanism in the .NET Framework to do this so we need to add a few helper methods to deal with rounding of time. Let's add a class called DateTimeRounding at the root of the project. Mark the class as a public static class and put the following extension methods in the class: public static DateTime RoundUp(this DateTime dt, TimeSpan d) {     var delta = (d.Ticks - (dt.Ticks % d.Ticks)) % d.Ticks;     return new DateTime(dt.Ticks + delta); }   public static DateTime RoundDown(this DateTime dt, TimeSpan d) {     var delta = dt.Ticks % d.Ticks;     return new DateTime(dt.Ticks - delta); }   public static DateTime RoundToNearest(this DateTime dt, TimeSpan d) {     var delta = dt.Ticks % d.Ticks;     bool roundUp = delta > d.Ticks / 2;       return roundUp ? dt.RoundUp(d) : dt.RoundDown(d); } Let's go back to the RequestStatisticsHub class and add some more functionality now so that we can deal with rounding of time: static void Register(Dictionary<string, int> log, Action<dynamic, string, int> hubCallback) {     var now = DateTime.Now.RoundToNearest(TimeSpan.FromSeconds(30));     var key = now.ToString("HH:mm");       if (log.ContainsKey(key))         log[key] = log[key] + 1;     else         log[key] = 1;       var hub = GlobalHost.ConnectionManager.GetHubContext<RequestStatisticsHub>() ;     hubCallback(hub.Clients.All, key, log[key]); }   public static void Request() {     Register(_requestsLog, (hub, key, value) => hub.requestCountChanged(key, value)); }   public static void FailedRequest() {     Register(_requestsLog, (hub, key, value) => hub.failedRequestCountChanged(key, value)); } This enables us to have a place to call in order to report requests and these get published back to any clients connected to this particular hub. Note the usage of GlobalHost and its ConnectionManager property. When we want to get a hub instance and when we are not in the hub context of a method being called from a client, we use ConnectionManager to get it. It gives is a proxy for the hub and enables us to call methods on any connected client. Naively dealing with requests With all this in place, we will be able to easily and naively deal with what we consider correct and failed requests. Let's add a Global.asax file by right-clicking on the project in Solution Explorer and select the New item from the Add. Navigate to Web and find Global Application Class, then click on Add. In the new file, we want to replace the BindingHandlers method with the following code snippet: protected void Application_AuthenticateRequest(object sender, EventArgs e) {     var path = HttpContext.Current.Request.Path;     if (path == "/") path = "index.html";       if (path.ToLowerInvariant().IndexOf(".html") < 0) return;       var physicalPath = HttpContext.Current.Request.MapPath(path);     if (File.Exists(physicalPath))     {         RequestStatisticsHub.Request();     }     else     {         RequestStatisticsHub.FailedRequest();     } } Basically, with this, we are only measuring requests with .html in its path, and if it's only "/", we assume it's "index.html". Any file that does not exist, accordingly, is considered an error; typically a 404 error and we register it as a failed request. Bringing it all back to the client With the server taken care of, we can start consuming all this in the client. We will now be heading down the path of creating a ViewModel and hook everything up. ViewModel Let's start by adding a JavaScript file sitting next to our index.html file at the root level of the project, call it index.js. This file will represent our ViewModel. Also, this scenario will be responsible to set up KnockoutJS, so that the ViewModel is in fact activated and applied to the page. As we only have this one page for this sample, this will be fine. Let's start by hooking up the jQuery document that is ready: $(function() { }); Inside the function created here, we will enter our viewModel definition, which will start off being an empty one: var viewModel = function() { }; KnockoutJS has a function to apply a viewModel to the document, meaning that the document or body will be associated with the viewModel instance given. Right under the definition of viewModel, add the following line: ko.applyBindings(new viewModel()); Compiling this and running it should at the very least not give you any errors but nothing more than a header saying My Dashboard. So, we need to lighten this up a bit. Inside the viewModel function definition, add the following code snippet: var self = this; this.requests = ko.observableArray(); this.failedRequests = ko.observableArray(); We enter a reference to this as a variant called self. This will help us with scoping issues later on. The arrays we added are now KnockoutJS's observable arrays that allows the view or any BindingHandler to observe the changes that are coming in. The ko.observableArray() and ko.observable() arrays both return a new function. So, if you want to access any values in it, you must unwrap it by calling it something that might seem counterintuitive at first. You might consider your variable as just another property. However, for the observableArray(), KnockoutJS adds most of the functions found in the array type in JavaScript and they can be used directly on the function without unwrapping. If you look at a variable that is an observableArray in the console of the browser, you'll see that it looks as if it actually is just any array. This is not really true though; to get to the values, you will have to unwrap it by adding () after accessing the variable. However, all the functions you're used to having on an array are here. Let's add a function that will know how to handle an entry into the viewModel function. An entry coming in is either an existing one or a new one; the key of the entry is the giveaway to decide: function handleEntry(log, key, value) {     var result = log().forEach(function (entry) {         if (entry[0] == key) {             entry[1](value);             return true;         }     });       if (result !== true) {         log.push([key, ko.observable(value)]);     } }; Let's set up the hub and add the following code to the viewModel function: var hub = $.connection.requestStatisticsHub; var initializedCount = 0;   hub.client.requestCountChanged = function (key, value) {     if (initializedCount < 2) return;     handleEntry(self.requests, key, value); }   hub.client.failedRequestCountChanged = function (key, value) {     if (initializedCount < 2) return;     handleEntry(self.failedRequests, key, value); } You might notice the initalizedCount variable. Its purpose is not to deal with requests until completely initialized, which comes next. Add the following code snippet to the viewModel function: $.connection.hub.start().done(function () {     hub.server.getRequests().done(function (requests) {         for (var property in requests) {             handleEntry(self.requests, property, requests[property]);         }           initializedCount++;     });     hub.server.getFailedRequests().done(function (requests) {         for (var property in requests) {             handleEntry(self.failedRequests, property, requests[property]);         }           initializedCount++;     }); }); We should now have enough logic in our viewModel function to actually be able to get any requests already sitting there and also respond to new ones coming. BindingHandler The key element of KnockoutJS is its BindingHandler mechanism. In KnockoutJS, everything starts with a data-bind="" attribute on an element in the HTML view. Inside the attribute, one puts binding expressions and the BindingHandlers are a key to this. Every expression starts with the name of the handler. For instance, if you have an <input> tag and you want to get the value from the input into a property on the ViewModel, you would use the BindingHandler value. There are a few BindingHandlers out of the box to deal with the common scenarios (text, value for each, and more). All of the BindingHandlers are very well documented on the KnockoutJS site. For this sample, we will actually create our own BindingHandler. KnockoutJS is highly extensible and allows you to do just this amongst other extensibility points. Let's add a JavaScript file called googleCharts.js at the root of the project. Inside it, add the following code: google.load('visualization', '1.0', { 'packages': ['corechart'] }); This will tell the Google API to enable the charting package. The next thing we want to do is to define the BindingHandler. Any handler has the option of setting up an init function and an update function. The init function should only occur once, when it's first initialized. Actually, it's when the binding context is set. If the parent binding context of the element changes, it will be called again. The update function will be called whenever there is a change in an observable or more observables that the binding expression is referring to. For our sample, we will use the init function only and actually respond to changes manually because we have a more involved scenario than what the default mechanism would provide us with. The update function that you can add to a BindingHandler has the exact same signature as the init function; hence, it is called an update. Let's add the following code underneath the load call: ko.bindingHandlers.lineChart = {     init: function (element, valueAccessor, allValueAccessors, viewModel, bindingContext) {     } }; This is the core structure of a BindingHandler. As you can see, we've named the BindingHandler as lineChart. This is the name we will use in our view later on. The signature of init and update are the same. The first parameter represents the element that holds the binding expression, whereas the second valueAccessor parameter holds a function that enables us to access the value, which is a result of the expression. KnockoutJS deals with the expression internally and parses any expression and figures out how to expand any values, and so on. Add the following code into the init function: optionsInput = valueAccessor();   var options = {     title: optionsInput.title,     width: optionsInput.width || 300,     height: optionsInput.height || 300,     backgroundColor: 'transparent',     animation: {         duration: 1000,         easing: 'out'     } };   var dataHash = {};   var chart = new google.visualization.LineChart(element); var data = new google.visualization.DataTable(); data.addColumn('string', 'x'); data.addColumn('number', 'y');   function addRow(row, rowIndex) {     var value = row[1];     if (ko.isObservable(value)) {         value.subscribe(function (newValue) {             data.setValue(rowIndex, 1, newValue);             chart.draw(data, options);         });     }       var actualValue = ko.unwrap(value);     data.addRow([row[0], actualValue]);       dataHash[row[0]] = actualValue; };   optionsInput.data().forEach(addRow);   optionsInput.data.subscribe(function (newValue) {     newValue.forEach(function(row, rowIndex) {         if( !dataHash.hasOwnProperty(row[0])) {             addRow(row,rowIndex);         }     });       chart.draw(data, options); });         chart.draw(data, options); As you can see, observables has a function called subscribe(), which is the same for both an observable array and a regular observable. The code adds a subscription to the array itself; if there is any change to the array, we will find the change and add any new row to the chart. In addition, when we create a new row, we subscribe to any change in its value so that we can update the chart. In the ViewModel, the values were converted into observable values to accommodate this. View Go back to the index.html file; we need the UI for the two charts we're going to have. Plus, we need to get both the new BindingHandler loaded and also the ViewModel. Add the following script references after the last script reference already present, as shown here: <script type="text/javascript" src="googleCharts.js"></script> <script type="text/javascript" src="index.js"></script> Inside the <body> tag below the header, we want to add a bootstrap container and a row to hold two metro styled tiles and utilize our new BindingHandler. Also, we want a footer sitting at the bottom, as shown in the following code: <div class="container">     <div class="row">         <div class="col-sm-6 col-md-4">             <div class="thumbnail tile tile-green-sea tile-large">                 <div data-bind="lineChart: { title: 'Web Requests', width: 300, height: 300, data: requests }"></div>             </div>         </div>           <div class="col-sm-6 col-md-4">             <div class="thumbnail tile tile-pomegranate tile- large">                 <div data-bind="lineChart: { title: 'Failed Web Requests', width: 300, height: 300, data: failedRequests }"></div>             </div>         </div>     </div>       <hr />     <footer class="bs-footer" role="contentinfo">         <div class="container">             The Dashboard         </div>     </footer> </div> Note the data: requests and data: failedRequests are a part of the binding expressions. These will be handled and resolved by KnockoutJS internally and pointed to the observable arrays on the ViewModel. The other properties are options that go into the BindingHandler and something it forwards to the Google Charting APIs. Trying it all out Running the preceding code (Ctrl + F5) should yield the following result: If you open a second browser and go to the same URL, you will see the change in the chart in real time. Waiting approximately for 30 seconds and refreshing the browser should add a second point automatically and also animate the chart accordingly. Typing a URL with a file that does exist should have the same effect on the failed requests chart. Summary In this article, we had a brief encounter with MVVM as a pattern with the sole purpose of establishing good practices for your client code. We added this to a single page application setting, sprinkling on top the SignalR to communicate from the server to any connected client. Resources for Article: Further resources on this subject: Using R for Statistics Research and Graphics? [article] Aspects Data Manipulation in R [article] Learning Data Analytics R and Hadoop [article]
Read more
  • 0
  • 0
  • 1928

article-image-dealing-interrupts
Packt
02 Mar 2015
19 min read
Save for later

Dealing with Interrupts

Packt
02 Mar 2015
19 min read
This article is written by Francis Perea, the author of the book Arduino Essentials. In all our previous projects, we have been constantly looking for events to occur. We have been polling, but looking for events to occur supposes a relatively big effort and a waste of CPU cycles to only notice that nothing happened. In this article, we will learn about interrupts as a totally new way to deal with events, being notified about them instead of looking for them constantly. Interrupts may be really helpful when developing projects in which fast or unknown events may occur, and thus we will see a very interesting project which will lead us to develop a digital tachograph for a computer-controlled motor. Are you ready? Here we go! (For more resources related to this topic, see here.) The concept of an interruption As you may have intuited, an interrupt is a special mechanism the CPU incorporates to have a direct channel to be noticed when some event occurs. Most Arduino microcontrollers have two of these: Interrupt 0 on digital pin 2 Interrupt 1 on digital pin 3 But some models, such as the Mega2560, come with up to five interrupt pins. Once an interrupt has been notified, the CPU completely stops what it was doing and goes on to look at it, by running a special dedicated function in our code called Interrupt Service Routine (ISR). When I say that the CPU completely stops, I mean that even functions such as delay() or millis() won't be updated while the ISR is being executed. Interrupts can be programmed to respond on different changes of the signal connected to the corresponding pin and thus the Arduino language has four predefined constants to represent each of these four modes: LOW: It will trigger the interrupt whenever the pin gets a LOW value CHANGE: The interrupt will be triggered when the pins change their values from HIGH to LOW or vice versa RISING: It will trigger the interrupt when signal goes from LOW to HIGH FALLING: It is just the opposite of RISING; the interrupt will be triggered when the signal goes from HIGH to LOW The ISR The function that the CPU will call whenever an interrupt occurs is so important to the micro that it has to accomplish a pair of rules: They can't have any parameter They can't return anything The interrupts can be executed only one at a time Regarding the first two points, they mean that we can neither pass nor receive any data from the ISR directly, but we have other means to achieve this communication with the function. We will use global variables for it. We can set and read from a global variable inside an ISR, but even so, these variables have to be declared in a special way. We have to declare them as volatile as we will see this later on in the code. The third point, which specifies that only one ISR can be attended at a time, is what makes the function millis() not being able to be updated. The millis() function relies on an interrupt to be updated, and this doesn't happen if another interrupt is already being served. As you may understand, ISR is critical to the correct code execution in a microcontroller. As a rule of thumb, we will try to keep our ISRs as simple as possible and leave all heavy weight processing that occurs outside of it, in the main loop of our code. The tachograph project To understand and manage interrupts in our projects, I would like to offer you a very particular one, a tachograph, a device that is present in all our cars and whose mission is to account for revolutions, normally the engine revolutions, but also in brake systems such as Anti-lock Brake System (ABS) and others. Mechanical considerations Well, calling it mechanical perhaps is too much, but let's make some considerations regarding how we are going to make our project account for revolutions. For this example project, I have used a small DC motor driven through a small transistor and, like in lots of industrial applications, an encoded wheel is a perfect mechanism to read the number of revolutions. By simply attaching a small disc of cardboard perpendicularly to your motor shaft, it is very easy to achieve it. By using our old friend, the optocoupler, we can sense something between its two parts, even with just a piece of cardboard with a small slot in just one side of its surface. Here, you can see the template I elaborated for such a disc, the cross in the middle will help you position the disc as perfectly as possible, that is, the cross may be as close as possible to the motor shaft. The slot has to be cut off of the black rectangle as shown in the following image: The template for the motor encoder Once I printed it, I glued it to another piece of cardboard to make it more resistant and glued it all to the crown already attached to my motor shaft. If yours doesn't have a surface big enough to glue the encoder disc to its shaft, then perhaps you can find a solution by using just a small piece of dough or similar to it. Once the encoder disc is fixed to the motor and spins attached to the motor shaft, we have to find a way to place the optocoupler in a way that makes it able to read through the encoder disc slot. In my case, just a pair of drops of glue did the trick, but if your optocoupler or motor doesn't allow you to apply this solution, I'm sure that a pair of zip ties or a small piece of dough can give you another way to fix it to the motor too. In the following image, you can see my final assembled motor with its encoder disc and optocoupler ready to be connected to the breadboard through alligator clips: The complete assembly for the motor encoder Once we have prepared our motor encoder, let's perform some tests to see it working and begin to write code to deal with interruptions. A simple interrupt tester Before going deep inside the whole code project, let's perform some tests to confirm that our encoder assembly is working fine and that we can correctly trigger an interrupt whenever the motor spins and the cardboard slot passes just through the optocoupler. The only thing you have to connect to your Arduino at the moment is the optocoupler; we will now operate our motor by hand and in a later section, we will control its speed from the computer. The test's circuit schematic is as follows: A simple circuit to test the encoder Nothing new in this circuit, it is almost the same as the one used in the optical coin detector, with the only important and necessary difference of connecting the wire coming from the detector side of the optocoupler to pin 2 of our Arduino board, because, as said in the preceding text, the interrupt 0 is available only through that pin. For this first test, we will make the encoder disc spin by hand, which allows us to clearly perceive when the interrupt triggers. For the rest of this example, we will use the LED included with the Arduino board connected to pin 13 as a way to visually indicate that the interrupts have been triggered. Our first interrupt and its ISR Once we have connected the optocoupler to the Arduino and prepared things to trigger some interrupts, let's see the code that we will use to test our assembly. The objective of this simple sketch is to commute the status of an LED every time an interrupt occurs. In the proposed tester circuit, the LED status variable will be changed every time the slot passes through the optocoupler: /*  Chapter 09 - Dealing with interrupts  A simple tester  By Francis Perea for Packt Publishing */   // A LED will be used to notify the change #define ledPin 13   // Global variables we will use // A variable to be used inside ISR volatile int status = LOW;   // A function to be called when the interrupt occurs void revolution(){   // Invert LED status   status=!status; }   // Configuration of the board: just one output void setup() {   pinMode(ledPin, OUTPUT);   // Assign the revolution() function as an ISR of interrupt 0   // Interrupt will be triggered when the signal goes from   // LOW to HIGH   attachInterrupt(0, revolution, RISING); }   // Sketch execution loop void loop(){    // Set LED status   digitalWrite(ledPin, status); } Let's take a look at its most important aspects. The LED pin apart, we declare a variable to account for changes occurring. It will be updated in the ISR of our interrupt; so, as I told you earlier, we declare it as follows: volatile int status = LOW; Following which we declare the ISR function, revolution(), which as we already know doesn't receive any parameter nor return any value. And as we said earlier, it must be as simple as possible. In our test case, the ISR simply inverts the value of the global volatile variable to its opposite value, that is, from LOW to HIGH and from HIGH to LOW. To allow our ISR to be called whenever an interrupt 0 occurs, in the setup() function, we make a call to the attachInterrupt() function by passing three parameters to it: Interrupt: The interrupt number to assign the ISR to ISR: The name without the parentheses of the function that will act as the ISR for this interrupt Mode: One of the following already explained modes that define when exactly the interrupt will be triggered In our case, the concrete sentence is as follows: attachInterrupt(0, revolution, RISING); This makes the function revolution() be the ISR of interrupt 0 that will be triggered when the signal goes from LOW to HIGH. Finally, in our main loop there is little to do. Simply update the LED based on the current value of the status variable that is going to be updated inside the ISR. If everything went right, you should see the LED commute every time the slot passes through the optocoupler as a consequence of the interrupt being triggered and the revolution() function inverting the value of the status variable that is used in the main loop to set the LED accordingly. A dial tachograph For a more complete example in this section, we will build a tachograph, a device that will present the current revolutions per minute of the motor in a visual manner by using a dial. The motor speed will be commanded serially from our computer by reusing some of the codes in our previous projects. It is not going to be very complicated if we include some way to inform about an excessive number of revolutions and even cut the engine in an extreme case to protect it, is it? The complete schematic of such a big circuit is shown in the following image. Don't get scared about the number of components as we have already seen them all in action before: The tachograph circuit As you may see, we will use a total of five pins of our Arduino board to sense and command such a set of peripherals: Pin 2: This is the interrupt 0 pin and thus it will be used to connect the output of the optocoupler. Pin 3: It will be used to deal with the servo to move the dial. Pin 4: We will use this pin to activate sound alarm once the engine current has been cut off to prevent overcharge. Pin 6: This pin will be used to deal with the motor transistor that allows us to vary the motor speed based on the commands we receive serially. Remember to use a PWM pin if you choose to use another one. Pin 13: Used to indicate with an LED an excessive number of revolutions per minute prior to cutting the engine off. There are also two more pins which, although not physically connected, will be used, pins 0 and 1, given that we are going to talk to the device serially from the computer. Breadboard connections diagram There are some wires crossed in the previous schematic, and perhaps you can see the connections better in the following breadboard connection image: Breadboard connection diagram for the tachograph The complete tachograph code This is going to be a project full of features and that is why it has such a number of devices to interact with. Let's resume the functioning features of the dial tachograph: The motor speed is commanded from the computer via a serial communication with up to five commands: Increase motor speed (+) Decrease motor speed (-) Totally stop the motor (0) Put the motor at full throttle (*) Reset the motor after a stall (R) Motor revolutions will be detected and accounted by using an encoder and an optocoupler Current revolutions per minute will be visually presented with a dial operated with a servomotor It gives visual indication via an LED of a high number of revolutions In case a maximum number of revolutions is reached, the motor current will be cut off and an acoustic alarm will sound With such a number of features, it is normal that the code for this project is going to be a bit longer than our previous sketches. Here is the code: /*  Chapter 09 - Dealing with interrupt  Complete tachograph system  By Francis Perea for Packt Publishing */   #include <Servo.h>   //The pins that will be used #define ledPin 13 #define motorPin 6 #define buzzerPin 4 #define servoPin 3   #define NOTE_A4 440 // Milliseconds between every sample #define sampleTime 500 // Motor speed increment #define motorIncrement 10 // Range of valir RPMs, alarm and stop #define minRPM  0 #define maxRPM 10000 #define alarmRPM 8000 #define stopRPM 9000   // Global variables we will use // A variable to be used inside ISR volatile unsigned long revolutions = 0; // Total number of revolutions in every sample long lastSampleRevolutions = 0; // A variable to convert revolutions per sample to RPM int rpm = 0; // LED Status int ledStatus = LOW; // An instace on the Servo class Servo myServo; // A flag to know if the motor has been stalled boolean motorStalled = false; // Thr current dial angle int dialAngle = 0; // A variable to store serial data int dataReceived; // The current motor speed int speed = 0; // A time variable to compare in every sample unsigned long lastCheckTime;   // A function to be called when the interrupt occurs void revolution(){   // Increment the total number of   // revolutions in the current sample   revolutions++; }   // Configuration of the board void setup() {   // Set output pins   pinMode(motorPin, OUTPUT);   pinMode(ledPin, OUTPUT);   pinMode(buzzerPin, OUTPUT);   // Set revolution() as ISR of interrupt 0   attachInterrupt(0, revolution, CHANGE);   // Init serial communication   Serial.begin(9600);   // Initialize the servo   myServo.attach(servoPin);   //Set the dial   myServo.write(dialAngle);   // Initialize the counter for sample time   lastCheckTime = millis(); }   // Sketch execution loop void loop(){    // If we have received serial data   if (Serial.available()) {     // read the next char      dataReceived = Serial.read();      // Act depending on it      switch (dataReceived){        // Increment speed        case '+':          if (speed<250) {            speed += motorIncrement;          }          break;        // Decrement speed        case '-':          if (speed>5) {            speed -= motorIncrement;          }          break;                // Stop motor        case '0':          speed = 0;          break;            // Full throttle           case '*':          speed = 255;          break;        // Reactivate motor after stall        case 'R':          speed = 0;          motorStalled = false;          break;      }     //Only if motor is active set new motor speed     if (motorStalled == false){       // Set the speed motor speed       analogWrite(motorPin, speed);     }   }   // If a sample time has passed   // We have to take another sample   if (millis() - lastCheckTime > sampleTime){     // Store current revolutions     lastSampleRevolutions = revolutions;     // Reset the global variable     // So the ISR can begin to count again     revolutions = 0;     // Calculate revolution per minute     rpm = lastSampleRevolutions * (1000 / sampleTime) * 60;     // Update last sample time     lastCheckTime = millis();     // Set the dial according new reading     dialAngle = map(rpm,minRPM,maxRPM,180,0);     myServo.write(dialAngle);   }   // If the motor is running in the red zone   if (rpm > alarmRPM){     // Turn on LED     digitalWrite(ledPin, HIGH);   }   else{     // Otherwise turn it off     digitalWrite(ledPin, LOW);   }   // If the motor has exceed maximum RPM   if (rpm > stopRPM){     // Stop the motor     speed = 0;     analogWrite(motorPin, speed);     // Disable it until a 'R' command is received     motorStalled = true;     // Make alarm sound     tone(buzzerPin, NOTE_A4, 1000);   }   // Send data back to the computer   Serial.print("RPM: ");   Serial.print(rpm);   Serial.print(" SPEED: ");   Serial.print(speed);   Serial.print(" STALL: ");   Serial.println(motorStalled); } It is the first time in this article that I think I have nothing to explain regarding the code that hasn't been already explained before. I have commented everything so that the code can be easily read and understood. In general lines, the code declares both constants and global variables that will be used and the ISR for the interrupt. In the setup section, all initializations of different subsystems that need to be set up before use are made: pins, interrupts, serials, and servos. The main loop begins by looking for serial commands and basically updates the speed value and the stall flag if command R is received. The final motor speed setting only occurs in case the stall flag is not on, which will occur in case the motor reaches the stopRPM value. Following with the main loop, the code looks if it has passed a sample time, in which case the revolutions are stored to compute real revolutions per minute (rpm), and the global revolutions counter incremented inside the ISR is set to 0 to begin again. The current rpm value is mapped to an angle to be presented by the dial and thus the servo is set accordingly. Next, a pair of controls is made: One to see if the motor is getting into the red zone by exceeding the max alarmRPM value and thus turning the alarm LED on And another to check if the stopRPM value has been reached, in which case the motor will be automatically cut off, the motorStalled flag is set to true, and the acoustic alarm is triggered When the motor has been stalled, it won't accept changes in its speed until it has been reset by issuing an R command via serial communication. In the last action, the code sends back some info to the Serial Monitor as another way of feedback with the operator at the computer and this should look something like the following screenshot: Serial Monitor showing the tachograph in action Modular development It has been quite a complex project in that it incorporates up to six different subsystems: optocoupler, motor, LED, buzzer, servo, and serial, but it has also helped us to understand that projects need to be developed by using a modular approach. We have worked and tested every one of these subsystems before, and that is the way it should usually be done. By developing your projects in such a submodular way, it will be easy to assemble and program the whole of the system. As you may see in the following screenshot, only by using such a modular way of working will you be able to connect and understand such a mess of wires: A working desktop may get a bit messy Summary I'm sure you have got the point regarding interrupts with all the things we have seen in this article. We have met and understood what an interrupt is and how does the CPU attend to it by running an ISR, and we have even learned about their special characteristics and restrictions and that we should keep them as little as possible. On the programming side, the only thing necessary to work with interrupts is to correctly attach the ISR with a call to the attachInterrupt() function. From the point of view of hardware, we have assembled an encoder that has been attached to a spinning motor to account for its revolutions. Finally, we have the code. We have seen a relatively long sketch, which is a sign that we are beginning to master the platform, are able to deal with a bigger number of peripherals, and that our projects require more complex software every time we have to deal with these peripherals and to accomplish all the other necessary tasks to meet what is specified in the project specifications. Resources for Article: Further resources on this subject: The Arduino Mobile Robot? [article] Using the Leap Motion Controller with Arduino [article] Android and Udoo Home Automation [article]
Read more
  • 0
  • 0
  • 28248
Modal Close icon
Modal Close icon