How-To Tutorials

26 Mar 2015

35 min read

Cassandra Architecture

26 Mar 2015

In this article by Nishant Neeraj, the author of the book Mastering Apache Cassandra - Second Edition, aims to set you into a perspective where you will be able to see the evolution of the NoSQL paradigm. It will start with a discussion of common problems that an average developer faces when the application starts to scale up and software components cannot keep up with it. Then, we'll see what can be assumed as a thumb rule in the NoSQL world: the CAP theorem that says to choose any two out of consistency, availability, and partition-tolerance. As we discuss this further, we will realize how much more important it is to serve the customers (availability), than to be correct (consistency) all the time. However, we cannot afford to be wrong (inconsistent) for a long time. The customers wouldn't like to see that the items are in stock, but that the checkout is failing. Cassandra comes into picture with its tunable consistency. (For more resources related to this topic, see here.) Problems in the RDBMS world RDBMS is a great approach. It keeps data consistent, it's good for OLTP (http://en.wikipedia.org/wiki/Online_transaction_processing), it provides access to good grammar, and manipulates data supported by all the popular programming languages. It has been tremendously successful in the last 40 years (the relational data model was proposed in its first incarnation by Codd, E.F. (1970) in his research paper A Relational Model of Data for Large Shared Data Banks). However, in early 2000s, big companies such as Google and Amazon, which have a gigantic load on their databases to serve, started to feel bottlenecked with RDBMS, even with helper services such as Memcache on top of them. As a response to this, Google came up with BigTable (http://research.google.com/archive/bigtable.html), and Amazon with Dynamo (http://www.cs.ucsb.edu/~agrawal/fall2009/dynamo.pdf). If you have ever used RDBMS for a complicated web application, you must have faced problems such as slow queries due to complex joins, expensive vertical scaling, and problems in horizontal scaling. Due to these problems, indexing takes a long time. At some point, you may have chosen to replicate the database, but there was still some locking, and this hurts the availability of the system. This means that under a heavy load, locking will cause the user's experience to deteriorate. Although replication gives some relief, a busy slave may not catch up with the master (or there may be a connectivity glitch between the master and the slave). Consistency of such systems cannot be guaranteed. Consistency, the property of a database to remain in a consistent state before and after a transaction, is one of the promises made by relational databases. It seems that one may need to make compromises on consistency in a relational database for the sake of scalability. With the growth of the application, the demand to scale the backend becomes more pressing, and the developer teams may decide to add a caching layer (such as Memcached) at the top of the database. This will alleviate some load off the database, but now the developers will need to maintain the object states in two places: the database, and the caching layer. Although some Object Relational Mappers (ORMs) provide a built-in caching mechanism, they have their own issues, such as larger memory requirement, and often mapping code pollutes application code. In order to achieve more from RDBMS, we will need to start to denormalize the database to avoid joins, and keep the aggregates in the columns to avoid statistical queries. Sharding or horizontal scaling is another way to distribute the load. Sharding in itself is a good idea, but it adds too much manual work, plus the knowledge of sharding creeps into the application code. Sharded databases make the operational tasks (backup, schema alteration, and adding index) difficult. To find out more about the hardships of sharding, visit http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/. There are ways to loosen up consistency by providing various isolation levels, but concurrency is just one part of the problem. Maintaining relational integrity, difficulties in managing data that cannot be accommodated on one machine, and difficult recovery, were all making the traditional database systems hard to be accepted in the rapidly growing big data world. Companies needed a tool that could support hundreds of terabytes of data on the ever-failing commodity hardware reliably. This led to the advent of modern databases like Cassandra, Redis, MongoDB, Riak, HBase, and many more. These modern databases promised to support very large datasets that were hard to maintain in SQL databases, with relaxed constrains on consistency and relation integrity. Enter NoSQL NoSQL is a blanket term for the databases that solve the scalability issues which are common among relational databases. This term, in its modern meaning, was first coined by Eric Evans. It should not be confused with the database named NoSQL (http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page). NoSQL solutions provide scalability and high availability, but may not guarantee ACID: atomicity, consistency, isolation, and durability in transactions. Many of the NoSQL solutions, including Cassandra, sit on the other extreme of ACID, named BASE, which stands for basically available, soft-state, eventual consistency. Wondering about where the name, NoSQL, came from? Read Eric Evans' blog at http://blog.sym-link.com/2009/10/30/nosql_whats_in_a_name.html. The CAP theorem In 2000, Eric Brewer (http://en.wikipedia.org/wiki/Eric_Brewer_%28scientist%29), in his keynote speech at the ACM Symposium, said, "A distributed system requiring always-on, highly-available operations cannot guarantee the illusion of coherent, consistent single-system operation in the presence of network partitions, which cut communication between active servers". This was his conjecture based on his experience with distributed systems. This conjecture was later formally proved by Nancy Lynch and Seth Gilbert in 2002 (Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services, published in ACMSIGACT News, Volume 33 Issue 2 (2002), page 51 to 59 available at http://webpages.cs.luc.edu/~pld/353/gilbert_lynch_brewer_proof.pdf). Let's try to understand this. Let's say we have a distributed system where data is replicated at two distinct locations and two conflicting requests arrive, one at each location, at the time of communication link failure between the two servers. If the system (the cluster) has obligations to be highly available (a mandatory response, even when some components of the system are failing), one of the two responses will be inconsistent with what a system with no replication (no partitioning, single copy) would have returned. To understand it better, let's take an example to learn the terminologies. Let's say you are planning to read George Orwell's book Nineteen Eighty-Four over the Christmas vacation. A day before the holidays start, you logged into your favorite online bookstore to find out there is only one copy left. You add it to your cart, but then you realize that you need to buy something else to be eligible for free shipping. You start to browse the website for any other item that you might buy. To make the situation interesting, let's say there is another customer who is trying to buy Nineteen Eighty-Four at the same time. Consistency A consistent system is defined as one that responds with the same output for the same request at the same time, across all the replicas. Loosely, one can say a consistent system is one where each server returns the right response to each request. In our example, we have only one copy of Nineteen Eighty-Four. So only one of the two customers is going to get the book delivered from this store. In a consistent system, only one can check out the book from the payment page. As soon as one customer makes the payment, the number of Nineteen Eighty-Four books in stock will get decremented by one, and one quantity of Nineteen Eighty-Four will be added to the order of that customer. When the second customer tries to check out, the system says that the book is not available any more. Relational databases are good for this task because they comply with the ACID properties. If both the customers make requests at the same time, one customer will have to wait till the other customer is done with the processing, and the database is made consistent. This may add a few milliseconds of wait to the customer who came later. An eventual consistent database system (where consistency of data across the distributed servers may not be guaranteed immediately) may have shown availability of the book at the time of check out to both the customers. This will lead to a back order, and one of the customers will be paid back. This may or may not be a good policy. A large number of back orders may affect the shop's reputation and there may also be financial repercussions. Availability Availability, in simple terms, is responsiveness; a system that's always available to serve. The funny thing about availability is that sometimes a system becomes unavailable exactly when you need it the most. In our example, one day before Christmas, everyone is buying gifts. Millions of people are searching, adding items to their carts, buying, and applying for discount coupons. If one server goes down due to overload, the rest of the servers will get even more loaded now, because the request from the dead server will be redirected to the rest of the machines, possibly killing the service due to overload. As the dominoes start to fall, eventually the site will go down. The peril does not end here. When the website comes online again, it will face a storm of requests from all the people who are worried that the offer end time is even closer, or those who will act quickly before the site goes down again. Availability is the key component for extremely loaded services. Bad availability leads to bad user experience, dissatisfied customers, and financial losses. Partition-tolerance Network partitioning is defined as the inability to communicate between two or more subsystems in a distributed system. This can be due to someone walking carelessly in a data center and snapping the cable that connects the machine to the cluster, or may be network outage between two data centers, dropped packages, or wrong configuration. Partition-tolerance is a system that can operate during the network partition. In a distributed system, a network partition is a phenomenon where, due to network failure or any other reason, one part of the system cannot communicate with the other part(s) of the system. An example of network partition is a system that has some nodes in a subnet A and some in subnet B, and due to a faulty switch between these two subnets, the machines in subnet A will not be able to send and receive messages from the machines in subnet B. The network will be allowed to lose many messages arbitrarily sent from one node to another. This means that even if the cable between the two nodes is chopped, the system will still respond to the requests. The following figure shows the database classification based on the CAP theorem: An example of a partition-tolerant system is a system with real-time data replication with no centralized master(s). So, for example, in a system where data is replicated across two data centers, the availability will not be affected, even if a data center goes down. The significance of the CAP theorem Once you decide to scale up, the first thing that comes to mind is vertical scaling, which means using beefier servers with a bigger RAM, more powerful processor(s), and bigger disks. For further scaling, you need to go horizontal. This means adding more servers. Once your system becomes distributed, the CAP theorem starts to play, which means, in a distributed system, you can choose only two out of consistency, availability, and partition-tolerance. So, let's see how choosing two out of the three options affects the system behavior as follows: CA system: In this system, you drop partition-tolerance for consistency and availability. This happens when you put everything related to a transaction on one machine or a system that fails like an atomic unit, like a rack. This system will have serious problems in scaling. CP system: The opposite of a CA system is a CP system. In a CP system, availability is sacrificed for consistency and partition-tolerance. What does this mean? If the system is available to serve the requests, data will be consistent. In an event of a node failure, some data will not be available. A sharded database is an example of such a system. AP system: An available and partition-tolerant system is like an always-on system that is at risk of producing conflicting results in an event of network partition. This is good for user experience, your application stays available, and inconsistency in rare events may be alright for some use cases. In our example, it may not be such a bad idea to back order a few unfortunate customers due to inconsistency of the system than having a lot of users return without making any purchases because of the system's poor availability. Eventual consistent (also known as BASE system): The AP system makes more sense when viewed from an uptime perspective—it's simple and provides a good user experience. But, an inconsistent system is not good for anything, certainly not good for business. It may be acceptable that one customer for the book Nineteen Eighty-Four gets a back order. But if it happens more often, the users would be reluctant to use the service. It will be great if the system could fix itself (read: repair) as soon as the first inconsistency is observed; or, maybe there are processes dedicated to fixing the inconsistency of a system when a partition failure is fixed or a dead node comes back to life. Such systems are called eventual consistent systems. The following figure shows the life of an eventual consistent system: Quoting Wikipedia, "[In a distributed system] given a sufficiently long time over which no changes [in system state] are sent, all updates can be expected to propagate eventually through the system and the replicas will be consistent". (The page on eventual consistency is available at http://en.wikipedia.org/wiki/Eventual_consistency.) Eventual consistent systems are also called BASE, a made-up term to represent that these systems are on one end of the spectrum, which has traditional databases with ACID properties on the opposite end. Cassandra is one such system that provides high availability and partition-tolerance at the cost of consistency, which is tunable. The preceding figure shows a partition-tolerant eventual consistent system. Cassandra Cassandra is a distributed, decentralized, fault tolerant, eventually consistent, linearly scalable, and column-oriented data store. This means that Cassandra is made to easily deploy over a cluster of machines located at geographically different places. There is no central master server, so no single point of failure, no bottleneck, data is replicated, and a faulty node can be replaced without any downtime. It's eventually consistent. It is linearly scalable, which means that with more nodes, the requests served per second per node will not go down. Also, the total throughput of the system will increase with each node being added. And finally, it's column oriented, much like a map (or better, a map of sorted maps) or a table with flexible columns where each column is essentially a key-value pair. So, you can add columns as you go, and each row can have a different set of columns (key-value pairs). It does not provide any relational integrity. It is up to the application developer to perform relation management. So, if Cassandra is so good at everything, why doesn't everyone drop whatever database they are using and jump start with Cassandra? This is a natural question. Some applications require strong ACID compliance, such as a booking system. If you are a person who goes by statistics, you'd ask how Cassandra fares with other existing data stores. TilmannRabl and others in their paper, Solving Big Data Challenges for Enterprise Application Performance Management (http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf), said that, "In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput from one to 12 nodes. This comes at the price of a high write and read latency. Cassandra's performance is best for high insertion rates". If you go through the paper, Cassandra wins in almost all the criteria. Equipped with proven concepts of distributed computing, made to reliably serve from commodity servers, and simple and easy maintenance, Cassandra is one of the most scalable, fastest, and very robust NoSQL database. So, the next natural question is, "What makes Cassandra so blazing fast?". Let's dive deeper into the Cassandra architecture. Understanding the architecture of Cassandra Cassandra is a relative latecomer in the distributed data-store war. It takes advantage of two proven and closely similar data-store mechanisms, namely Bigtable: A Distributed Storage System for Structured Data, 2006 (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) and Amazon Dynamo: Amazon's Highly Available Key-value Store, 2007 (http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf). The following diagram displays the read throughputs that show linear scaling of Cassandra: Like BigTable, it has a tabular data presentation. It is not tabular in the strictest sense. It is rather a dictionary-like structure where each entry holds another sorted dictionary/map. This model is more powerful than the usual key-value store and it is named a table, formerly known as a column family. The properties such as eventual consistency and decentralization are taken from Dynamo. Now, assume a column family is a giant spreadsheet, such as MS Excel. But unlike spreadsheets, each row is identified by a row key with a number (token), and unlike spreadsheets, each cell may have its own unique name within the row. In Cassandra, the columns in the rows are sorted by this unique column name. Also, since the number of partitions is allowed to be very large (1.7*1038), it distributes the rows almost uniformly across all the available machines by dividing the rows in equal token groups. Tables or column families are contained within a logical container or name space called keyspace. A keyspace can be assumed to be more or less similar to database in RDBMS. A word on max number of cells, rows, and partitions A cell in a partition can be assumed as a key-value pair. The maximum number of cells per partition is limited by the Java integer's max value, which is about 2 billion. So, one partition can hold a maximum of 2 billion cells. A row, in CQL terms, is a bunch of cells with predefined names. When you define a table with a primary key that has just one column, the primary key also serves as the partition key. But when you define a composite primary key, the first column in the definition of the primary key works as the partition key. So, all the rows (bunch of cells) that belong to one partition key go into one partition. This means that every partition can have a maximum of X rows, where X = (2*109/number_of_columns_in_a_row). Essentially, rows * columns cannot exceed 2 billion per partition. Finally, how many partitions can Cassandra hold for each table or column family? As we know, column families are essentially distributed hashmaps. The keys or row keys or partition keys are generated by taking a consistent hash of the string that you pass. So, the number of partitioned keys is bounded by the number of hashes these functions generate. This means that if you are using the default Murmur3 partitioner (range -263 to +263), the maximum number of partitions that you can have is 1.85*1019. If you use the Random partitioner, the number of partitions that you can have is 1.7*1038. Ring representation A Cassandra cluster is called a ring. The terminology is taken from Amazon Dynamo. Cassandra 1.1 and earlier versions used to have a token assigned to each node. Let's call this value the initial token. Each node is responsible for storing all the rows with token values (a token is basically a hash value of a row key) ranging from the previous node's initial token (exclusive) to the node's initial token (inclusive). This way, the first node, the one with the smallest initial token, will have a range from the token value of the last node (the node with the largest initial token) to the first token value. So, if you jump from node to node, you will make a circle, and this is why a Cassandra cluster is called a ring. Let's take an example. Assume that there is a hashing algorithm (partitioner) that generates tokens from 0 to 127 and you have four Cassandra machines to create a cluster. To allocate equal load, we need to assign each of the four nodes to bear an equal number of tokens. So, the first machine will be responsible for tokens one to 32, the second will hold 33 to 64, third 65 to 96, and fourth 97 to 127 and 0. If you mark each node with the maximum token number that it can hold, the cluster looks like a ring, as shown in the following figure: Token ownership and distribution in a balanced Cassandra ring Virtual nodes In Cassandra 1.1 and previous versions, when you create a cluster or add a node, you manually assign its initial token. This is extra work that the database should handle internally. Apart from this, adding and removing nodes requires manual resetting token ownership for some or all nodes. This is called rebalancing. Yet another problem was replacing a node. In the event of replacing a node with a new one, the data (rows that the to-be-replaced node owns) is required to be copied to the new machine from a replica of the old machine. For a large database, this could take a while because we are streaming from one machine. To solve all these problems, Cassandra 1.2 introduced virtual nodes (vnodes). The following figure shows 16 vnodes distributed over four servers: In the preceding figure, each node is responsible for a single continuous range. In the case of a replication factor of 2 or more, the data is also stored on other machines than the one responsible for the range. (Replication factor (RF) represents the number of copies of a table that exist in the system. So, RF=2, means there are two copies of each record for the table.) In this case, one can say one machine, one range. With vnodes, each machine can have multiple smaller ranges and these ranges are automatically assigned by Cassandra. How does this solve those issues? Let's see. If you have a 30 ring cluster and a node with 256 vnodes had to be replaced. If nodes are well-distributed randomly across the cluster, each physical node in remaining 29 nodes will have 8 or 9 vnodes (256/29) that are replicas of vnodes on the dead node. In older versions, with a replication factor of 3, the data had to be streamed from three replicas (10 percent utilization). In the case of vnodes, all the nodes can participate in helping the new node get up. The other benefit of using vnodes is that you can have a heterogeneous ring where some machines are more powerful than others, and change the vnodes ' settings such that the stronger machines will take proportionally more data than others. This was still possible without vnodes but it needed some tricky calculation and rebalancing. So, let's say you have a cluster of machines with similar hardware specifications and you have decided to add a new server that is twice as powerful as any machine in the cluster. Ideally, you would want it to work twice as harder as any of the old machines. With vnodes, you can achieve this by setting twice as many num_tokens as on the old machine in the new machine's cassandra.yaml file. Now, it will be allotted double the load when compared to the old machines. Yet another benefit of vnodes is faster repair. Node repair requires the creation of a Merkle tree for each range of data that a node holds. The data gets compared with the data on the replica nodes, and if needed, data re-sync is done. Creation of a Merkle tree involves iterating through all the data in the range followed by streaming it. For a large range, the creation of a Merkle tree can be very time consuming while the data transfer might be much faster. With vnodes, the ranges are smaller, which means faster data validation (by comparing with other nodes). Since the Merkle tree creation process is broken into many smaller steps (as there are many small nodes that exist in a physical node), the data transmission does not have to wait till the whole big range finishes. Also, the validation uses all other machines instead of just a couple of replica nodes. As of Cassandra 2.0.9, the default setting for vnodes is "on" with default vnodes per machine as 256. If for some reason you do not want to use vnodes and want to disable this feature, comment out the num_tokens variable and uncomment and set the initial_token variable in cassandra.yaml. If you are starting with a new cluster or migrating an old cluster to the latest version of Cassandra, vnodes are highly recommended. The number of vnodes that you specify on a Cassandra node represents the number of vnodes on that machine. So, the total vnodes on a cluster is the sum total of all the vnodes across all the nodes. One can always imagine a Cassandra cluster as a ring of lots of vnodes. How Cassandra works Diving into various components of Cassandra without having any context is a frustrating experience. It does not make sense why you are studying SSTable, MemTable, and log structured merge (LSM) trees without being able to see how they fit into the functionality and performance guarantees that Cassandra gives. So first we will see Cassandra's write and read mechanism. It is possible that some of the terms that we encounter during this discussion may not be immediately understandable. A rough overview of the Cassandra components is as shown in the following figure: Main components of the Cassandra service The main class of Storage Layer is StorageProxy. It handles all the requests. The messaging layer is responsible for inter-node communications, such as gossip. Apart from this, process-level structures keep a rough idea about the actual data containers and where they live. There are four data buckets that you need to know. MemTable is a hash table-like structure that stays in memory. It contains actual cell data. SSTable is the disk version of MemTables. When MemTables are full, they are persisted to hard disk as SSTable. Commit log is an append only log of all the mutations that are sent to the Cassandra cluster. Mutations can be thought of as update commands. So, insert, update, and delete operations are mutations, since they mutate the data. Commit log lives on the disk and helps to replay uncommitted changes. These three are basically core data. Then there are bloom filters and index. The bloom filter is a probabilistic data structure that lives in the memory. They both live in memory and contain information about the location of data in the SSTable. Each SSTable has one bloom filter and one index associated with it. The bloom filter helps Cassandra to quickly detect which SSTable does not have the requested data, while the index helps to find the exact location of the data in the SSTable file. With this primer, we can start looking into how write and read works in Cassandra. We will see more explanation later. Write in action To write, clients need to connect to any of the Cassandra nodes and send a write request. This node is called the coordinator node. When a node in a Cassandra cluster receives a write request, it delegates the write request to a service called StorageProxy. This node may or may not be the right place to write the data. StorageProxy's job is to get the nodes (all the replicas) that are responsible for holding the data that is going to be written. It utilizes a replication strategy to do this. Once the replica nodes are identified, it sends the RowMutation message to them, the node waits for replies from these nodes, but it does not wait for all the replies to come. It only waits for as many responses as are enough to satisfy the client's minimum number of successful writes defined by ConsistencyLevel. ConsistencyLevel is basically a fancy way of saying how reliable a read or write you want to be. Cassandra has tunable consistency, which means you can define how much reliability is wanted. Obviously, everyone wants a hundred percent reliability, but it comes with latency as the cost. For instance, in a thrice-replicated cluster (replication factor = 3), a write time consistency level TWO, means the write will become successful only if it is written to at least two replica nodes. This request will be faster than the one with the consistency level THREE or ALL, but slower than the consistency level ONE or ANY. The following figure is a simplistic representation of the write mechanism. The operations on node N2 at the bottom represent the node-local activities on receipt of the write request: The following steps show everything that can happen during a write mechanism: If the failure detector detects that there aren't enough live nodes to satisfy ConsistencyLevel, the request fails. If the failure detector gives a green signal, but writes time-out after the request is sent due to infrastructure problems or due to extreme load, StorageProxy writes a local hint to replay when the failed nodes come back to life. This is called hinted hand off. One might think that hinted handoff may be responsible for Cassandra's eventual consistency. But it's not entirely true. If the coordinator node gets shut down or dies due to hardware failure and hints on this machine cannot be forwarded, eventual consistency will not occur. The anti-entropy mechanism is responsible for consistency, rather than hinted hand-off. Anti-entropy makes sure that all replicas are in sync. If the replica nodes are distributed across data centers, it will be a bad idea to send individual messages to all the replicas in other data centers. Rather, it sends the message to one replica in each data center with a header, instructing it to forward the request to other replica nodes in that data center. Now the data is received by the node which should actually store that data. The data first gets appended to the commit log, and pushed to a MemTable appropriate column family in the memory. When the MemTable becomes full, it gets flushed to the disk in a sorted structure named SSTable. With lots of flushes, the disk gets plenty of SSTables. To manage SSTables, a compaction process runs. This process merges data from smaller SSTables to one big sorted file. Read in action Similar to a write case, when StorageProxy of the node that a client is connected to gets the request, it gets a list of nodes containing this key based on the replication strategy. The node's StorageProxy then sorts the nodes based on their proximity to itself. The proximity is determined by the snitch function that is set up for this cluster. Basically, the following types of snitches exist: SimpleSnitch: A closer node is the one that comes first when moving clockwise in the ring. (A ring is when all the machines in the cluster are placed in a circular fashion with each machine having a token number. When you walk clockwise, the token value increases. At the end, it snaps back to the first node.) PropertyFileSnitch: This snitch allows you to specify how you want your machines' location to be interpreted by Cassandra. You do this by assigning a data center name and rack name for all the machines in the cluster in the $CASSANDRA_HOME/conf/cassandra-topology.properties file. Each node has a copy of this file and you need to alter this file each time you add or remove a node. This is what the file looks like: # Cassandra Node IP=Data Center:Rack 192.168.1.100=DC1:RAC1 192.168.2.200=DC2:RAC2 10.0.0.10=DC1:RAC1 10.0.0.11=DC1:RAC1 10.0.0.12=DC1:RAC2 10.20.114.10=DC2:RAC1 10.20.114.11=DC2:RAC1 GossipingPropertyFileSnitch: The PropertyFileSnitch is kind of a pain, even when you think about it. Each node has the locations of all nodes manually written and updated every time a new node joins or an old node retires. And then, we need to copy it on all the servers. Wouldn't it be better if we just specify each node's data center and rack on just that one machine, and then have Cassandra somehow collect this information to understand the topology? This is exactly what GossipingPropertyFileSnitch does. Similar to PropertyFileSnitch, you have a file called $CASSANDRA_HOME/conf/cassandra-rackdc.properties, and in this file you specify the data center and the rack name for that machine. The gossip protocol makes sure that this information gets spread to all the nodes in the cluster (and you do not have to edit properties of files on all the nodes when a new node joins or leaves). Here is what a cassandra-rackdc.properties file looks like: # indicate the rack and dc for this node dc=DC13 rack=RAC42 RackInferringSnitch: This snitch infers the location of a node based on its IP address. It uses the third octet to infer rack name, and the second octet to assign data center. If you have four nodes 10.110.6.30, 10.110.6.4, 10.110.7.42, and 10.111.3.1, this snitch will think the first two live on the same rack as they have the same second octet (110) and the same third octet (6), while the third lives in the same data center but on a different rack as it has the same second octet but the third octet differs. Fourth, however, is assumed to live in a separate data center as it has a different second octet than the three. EC2Snitch: This is meant for Cassandra deployments on Amazon EC2 service. EC2 has regions and within regions, there are availability zones. For example, us-east-1e is an availability zone in the us-east region with availability zone named 1e. This snitch infers the region name (us-east, in this case) as the data center and availability zone (1e) as the rack. EC2MultiRegionSnitch: The multi-region snitch is just an extension of EC2Snitch where data centers and racks are inferred the same way. But you need to make sure that broadcast_address is set to the public IP provided by EC2 and seed nodes must be specified using their public IPs so that inter-data center communication can be done. DynamicSnitch: This Snitch determines closeness based on a recent performance delivered by a node. So, a quick responding node is perceived as being closer than a slower one, irrespective of its location closeness, or closeness in the ring. This is done to avoid overloading a slow performing node. DynamicSnitch is used by all the other snitches by default. You can disable it, but it is not advisable. Now, with knowledge about snitches, we know the list of the fastest nodes that have the desired row keys, it's time to pull data from them. The coordinator node (the one that the client is connected to) sends a command to the closest node to perform a read (we'll discuss local reads in a minute) and return the data. Now, based on ConsistencyLevel, other nodes will send a command to perform a read operation and send just the digest of the result. If we have read repairs (discussed later) enabled, the remaining replica nodes will be sent a message to compute the digest of the command response. Let's take an example. Let's say you have five nodes containing a row key K (that is, RF equals five), your read ConsistencyLevel is three; then the closest of the five nodes will be asked for the data and the second and third closest nodes will be asked to return the digest. If there is a difference in the digests, full data is pulled from the conflicting node and the latest of the three will be sent. These replicas will be updated to have the latest data. We still have two nodes left to be queried. If read repairs are not enabled, they will not be touched for this request. Otherwise, these two will be asked to compute the digest. Depending on the read_repair_chance setting, the request to the last two nodes is done in the background, after returning the result. This updates all the nodes with the most recent value, making all replicas consistent. Let's see what goes on within a node. Take a simple case of a read request looking for a single column within a single row. First, the attempt is made to read from MemTable, which is rapid fast, and since there exists only one copy of data, this is the fastest retrieval. If all required data is not found there, Cassandra looks into SSTable. Now, remember from our earlier discussion that we flush MemTables to disk as SSTables and later when the compaction mechanism wakes up, it merges those SSTables. So, our data can be in multiple SSTables. The following figure represents a simplified representation of the read mechanism. The bottom of the figure shows processing on the read node. The numbers in circles show the order of the event. BF stands for bloom filter. Each SSTable is associated with its bloom filter built on the row keys in the SSTable. Bloom filters are kept in the memory and used to detect if an SSTable may contain (false positive) the row data. Now, we have the SSTables that may contain the row key. The SSTables get sorted in reverse chronological order (latest first). Apart from the bloom filter for row keys, there exists one bloom filter for each row in the SSTable. This secondary bloom filter is created to detect whether the requested column names exist in the SSTable. Now, Cassandra will take SSTables one by one from younger to older, and use the index file to locate the offset for each column value for that row key and the bloom filter associated with the row (built on the column name). On the bloom filter being positive for the requested column, it looks into the SSTable file to read the column value. Note that we may have a column value in other yet-to-be-read SSTables, but that does not matter, because we are reading the most recent SSTables first, and any value that was written earlier to it does not matter. So, the value gets returned as soon as the first column in the most recent SSTable is allocated. Summary By now, you should be familiar with all the nuts and bolts of Cassandra. We have discussed how the pressure to make data stores to web scale inspired a rather not-so-common database mechanism to become mainstream, and how the CAP theorem governs the behavior of such databases. We have seen that Cassandra shines out among its peers. Then, we dipped our toes into the big picture of Cassandra read and write mechanisms. This left us with lots of fancy terms. It is understandable that it may be a lot to take in for someone new to NoSQL systems. It is okay if you do not have complete clarity at this point. As you start working with Cassandra, tweaking it, experimenting with it, and going through the Cassandra mailing list discussions or talks, you will start to come across stuff that you have read in this article and it will start to make sense, and perhaps you may want to come back and refer to this article to improve your clarity. It is not required that you understand this article fully to be able to write queries, set up clusters, maintain clusters, or do anything else related to Cassandra. A general sense of this article will take you far enough to work extremely well with Cassandra-based projects. How does this knowledge help us in building an application? Isn't it just about learning Thrift or CQL API and getting going? You might be wondering why you need to know about the compaction and storage mechanism, when all you need to do is to deliver an application that has a fast backend. It may not be obvious at this point why you are learning this, but as we move ahead with developing an application, we will come to realize that knowledge about underlying storage mechanism helps. Later, if you will deploy a cluster, performance tuning, maintenance, and integrating with other tools such as Apache Hadoop, you may find this article useful. Resources for Article: Further resources on this subject: About Cassandra [article] Replication [article] Getting Up and Running with Cassandra [article]

0
0
5947

Packt

26 Mar 2015

6 min read

Subscribing to a report

Packt

26 Mar 2015

6 min read

In this article by Johan Yu, the author of Salesforce Reporting and Dashboards, we get acquainted to the components used when working with reports on the Salesforce platform. Subscribing to a report is a new feature in Salesforce introduced in the Spring 2015 release. When you subscribe to a report, you will get a notification on weekdays, daily, or weekly, when the reports meet the criteria defined. You just need to subscribe to the report that you most care about. (For more resources related to this topic, see here.) Subscribing to a report is not the same as the report's Schedule Future Run option, where scheduling a report for a future run will keep e-mailing you the report content at a specified frequency defined, without specifying any conditions. But when you subscribe to a report, you will receive notifications when the report output meets the criteria you have defined. Subscribing to a report will not send you the e-mail content, but just an alert that the report you subscribed to meets the conditions specified. To subscribe to a report, you do not need additional permission as our administrator is able to control to enable or disable this feature for the entire organization. By default, this feature will be turned on for customers using the Salesforce Spring 2015 release. If you are an administrator for the organization, you can check out this feature by navigating to Setup | Customize | Reports & Dashboards | Report Notification | Enable report notification subscriptions for all users. Besides receiving notifications via e-mail, you also can opt for Salesforce1 notifications and posts to Chatter feeds, and execute a custom action. Report Subscription To subscribe to a report, you need to define a set of conditions to trigger the notifications. Here is what you need to understand before you subscribe to a report: When: Everytime conditions are met or only the first time conditions are met. Conditions: An aggregate can be a record count or a summarize field. Then define the operator and value you want the aggregate to be compared to. The summarize field means a field that you use in that report to summarize its data as average, smallest, largest, or sum. You can add multiple conditions, but at this moment, you only have the AND condition. Schedule frequency: Schedule weekday, daily, weekly, and the time the report will be run. Actions: E-mail notifications: You will get e-mail alerts when conditions are met. Posts to Chatter feeds: Alerts will be posted to your Chatter feed. Salesforce1 notifications: Alerts in your Salesforce1 app. Execute a custom action: This will trigger a call to the apex class. You will need a developer to write apex code for this. Active: This is a checkbox used to activate or disable subscription. You may just need to disable it when you need to unsubscribe temporarily; otherwise, deleting will remove all the settings defined. The following screenshot shows the conditions set in order to subscribe to a report: Monitoring a report subscription How can you know whether you have subscribed to a report? When you open the report and see the Subscribe button, it means you are not subscribed to that report: Once you configure the report to subscribe, the button label will turn to Edit Subscription. But, do not get it wrong that not all reports with Edit Subscription, you will get alerts when the report meets the criteria, because the setting may just not be active, remember step above when you subscribe a report. To know all the reports you subscribe to at a glance, as long as you have View Setup and Configuration permissions, navigate to Setup | Jobs | Scheduled Jobs, and look for Type as Reporting Notification, as shown in this screenshot: Hands-on – subscribing to a report Here is our next use case: you would like to get a notification in your Salesforce1 app—an e-mail notification—and also posts on your Chatter feed once the Closed Won opportunity for the month has reached $50,000. Salesforce should check the report daily, but instead of getting this notification daily, you want to get it only once a week or month; otherwise, it will be disturbing. Creating reports Make sure you set the report with the correct filter, set Close Date as This Month, and summarize the Amount field, as shown in the following screenshot: Subscribing Click on the Subscribe button and fill in the following details: Type as Only the first time conditions are met Conditions: Aggregate as Sum of Amount Operator as Greater Than or Equal Value as 50000 Schedule: Frequency as Every Weekday Time as 7AM In Actions, select: Send Salesforce1 Notification Post to Chatter Feed Send Email Notification In Active, select the checkbox Testing and saving The good thing of this feature is the ability to test without waiting until the scheduled date or time. Click on the Save & Run Now button. Here is the result: Salesforce1 notifications Open your Salesforce1 mobile app, look for the notification icon, and notice a new alert from the report you subscribed to, as shown in this screenshot: If you click on the notification, it will take you to the report that is shown in the following screenshot: Chatter feed Since you selected the Post to Chatter Feed action, the same alert will go to your Chatter feed as well. Clicking on the link in the Chatter feed will open the same report in your Salesforce1 mobile app or from the web browser, as shown in this screenshot: E-mail notification The last action we've selected for this exercise is to send an e-mail notification. The following screenshot shows how the e-mail notification would look: Limitations The following limitations are observed while subscribing to a report: You can set up to five conditions per report, and no OR logic conditions are possible You can subscribe for up to five reports, so use it wisely Summary In this article, you became familiar with components when working with reports on the Salesforce platform. We saw different report formats and the uniqueness of each format. We continued discussions on adding various types of charts to the report with point-and-click effort and no code; all of this can be done within minutes. We saw how to add filters to reports to customize our reports further, including using Filter Logic, Cross Filter, and Row Limit for tabular reports. We walked through managing and customizing custom report types, including how to hide unused report types and report type adoption analysis. In the last part of this article, we saw how easy it is to subscribe to a report and define criteria. Resources for Article: Further resources on this subject: Salesforce CRM – The Definitive Admin Handbook - Third Edition [article] Salesforce.com Customization Handbook [article] Developing Applications with Salesforce Chatter [article]

0
0
2584

Packt

25 Mar 2015

35 min read

Fun with Sprites – Sky Defense

Packt

25 Mar 2015

35 min read

0
0
6889

Packt

25 Mar 2015

7 min read

Geolocating photos on the map

Packt

25 Mar 2015

7 min read

In this article by Joel Lawhead, author of the book, QGIS Python Programming Cookbook uses the tags to create locations on a map for some photos and provide links to open them. (For more resources related to this topic, see here.) Getting ready You will need to download some sample geotagged photos from https://github.com/GeospatialPython/qgis/blob/gh-pages/photos.zip?raw=true and place them in a directory named photos in your qgis_data directory. How to do it... QGIS requires the Python Imaging Library (PIL), which should already be included with your installation. PIL can parse EXIF tags. We will gather the filenames of the photos, parse the location information, convert it to decimal degrees, create the point vector layer, add the photo locations, and add an action link to the attributes. To do this, we need to perform the following steps: In the QGIS Python Console, import the libraries that we'll need, including k for parsing image data and the glob module for doing wildcard file searches: import globimport Imagefrom ExifTags import TAGS Next, we'll create a function that can parse the header data: def exif(img): exif_data = {} try: i = Image.open(img) tags = i._getexif() for tag, value in tags.items(): decoded = TAGS.get(tag, tag) exif_data[decoded] = value except: passreturn exif_data Now, we'll create a function that can convert degrees-minute-seconds to decimal degrees, which is how coordinates are stored in JPEG images: def dms2dd(d, m, s, i): sec = float((m * 60) + s) dec = float(sec / 3600) deg = float(d + dec) if i.upper() == 'W': deg = deg * -1 elif i.upper() == 'S': deg = deg * -1 return float(deg) Next, we'll define a function to parse the location data from the header data: def gps(exif): lat = None lon = None if exif['GPSInfo']: # Lat coords = exif['GPSInfo'] i = coords[1] d = coords[2][0][0] m = coords[2][1][0] s = coords[2][2][0] lat = dms2dd(d, m ,s, i) # Lon i = coords[3] d = coords[4][0][0] m = coords[4][1][0] s = coords[4][2][0] lon = dms2dd(d, m ,s, i)return lat, lon Next, we'll loop through the photos directory, get the filenames, parse the location information, and build a simple dictionary to store the information, as follows: photos = {}photo_dir = "/Users/joellawhead/qgis_data/photos/"files = glob.glob(photo_dir + "*.jpg")for f in files: e = exif(f) lat, lon = gps(e) photos[f] = [lon, lat] Now, we'll set up the vector layer for editing: lyr_info = "Point?crs=epsg:4326&field=photo:string(75)"vectorLyr = QgsVectorLayer(lyr_info, "Geotagged Photos" , "memory")vpr = vectorLyr.dataProvider() We'll add the photo details to the vector layer: features = []for pth, p in photos.items(): lon, lat = p pnt = QgsGeometry.fromPoint(QgsPoint(lon,lat)) f = QgsFeature() f.setGeometry(pnt) f.setAttributes([pth]) features.append(f)vpr.addFeatures(features)vectorLyr.updateExtents() Now, we can add the layer to the map and make the active layer: QgsMapLayerRegistry.instance().addMapLayer(vectorLyr)iface.setActiveLayer(vectorLyr)activeLyr = iface.activeLayer() Finally, we'll add an action that allows you to click on it and open the photo: actions = activeLyr.actions()actions.addAction(QgsAction.OpenUrl, "Photos", '[% "photo" %]') How it works... Using the included PIL EXIF parser, getting location information and adding it to a vector layer is relatively straightforward. This action is a default option for opening a URL. However, you can also use Python expressions as actions to perform a variety of tasks. The following screenshot shows an example of the data visualization and photo popup: There's more... Another plugin called Photo2Shape is available, but it requires you to install an external EXIF tag parser. Image change detection Change detection allows you to automatically highlight the differences between two images in the same area if they are properly orthorectified. We'll do a simple difference change detection on two images, which are several years apart, to see the differences in urban development and the natural environment. Getting ready You can download the two images from https://github.com/GeospatialPython/qgis/blob/gh-pages/change-detection.zip?raw=true and put them in a directory named change-detection in the rasters directory of your qgis_data directory. Note that the file is 55 megabytes, so it may take several minutes to download. How to do it... We'll use the QGIS raster calculator to subtract the images in order to get the difference, which will highlight significant changes. We'll also add a color ramp shader to the output in order to visualize the changes. To do this, we need to perform the following steps: First, we need to import the libraries that we need in to the QGIS console: from PyQt4.QtGui import *from PyQt4.QtCore import *from qgis.analysis import * Now, we'll set up the path names and raster names for our images: before = "/Users/joellawhead/qgis_data/rasters/change-detection/before.tif"|after = "/Users/joellawhead/qgis_data/rasters/change-detection/after.tif"beforeName = "Before"afterName = "After" Next, we'll establish our images as raster layers: beforeRaster = QgsRasterLayer(before, beforeName)afterRaster = QgsRasterLayer(after, afterName) Then, we can build the calculator entries: beforeEntry = QgsRasterCalculatorEntry()afterEntry = QgsRasterCalculatorEntry()beforeEntry.raster = beforeRasterafterEntry.raster = afterRasterbeforeEntry.bandNumber = 1afterEntry.bandNumber = 2beforeEntry.ref = beforeName + "@1"afterEntry.ref = afterName + "@2"entries = [afterEntry, beforeEntry] Now, we'll set up the simple expression that does the math for remote sensing: exp = "%s - %s" % (afterEntry.ref, beforeEntry.ref) Then, we can set up the output file path, the raster extent, and pixel width and height: output = "/Users/joellawhead/qgis_data/rasters/change-detection/change.tif"e = beforeRaster.extent()w = beforeRaster.width()h = beforeRaster.height() Now, we perform the calculation: change = QgsRasterCalculator(exp, output, "GTiff", e, w, h, entries)change.processCalculation() Finally, we'll load the output as a layer, create the color ramp shader, apply it to the layer, and add it to the map, as shown here: lyr = QgsRasterLayer(output, "Change")algorithm = QgsContrastEnhancement.StretchToMinimumMaximumlimits = QgsRaster.ContrastEnhancementMinMaxlyr.setContrastEnhancement(algorithm, limits)s = QgsRasterShader()c = QgsColorRampShader()c.setColorRampType(QgsColorRampShader.INTERPOLATED)i = []qri = QgsColorRampShader.ColorRampItemi.append(qri(0, QColor(0,0,0,0), 'NODATA'))i.append(qri(-101, QColor(123,50,148,255), 'Significant Itensity Decrease'))i.append(qri(-42.2395, QColor(194,165,207,255), 'Minor Itensity Decrease'))i.append(qri(16.649, QColor(247,247,247,0), 'No Change'))i.append(qri(75.5375, QColor(166,219,160,255), 'Minor Itensity Increase'))i.append(qri(135, QColor(0,136,55,255), 'Significant Itensity Increase'))c.setColorRampItemList(i)s.setRasterShaderFunction(c)ps = QgsSingleBandPseudoColorRenderer(lyr.dataProvider(), 1, s)lyr.setRenderer(ps)QgsMapLayerRegistry.instance().addMapLayer(lyr) How it works... If a building is added in the new image, it will be brighter than its surroundings. If a building is removed, the new image will be darker in that area. The same holds true for vegetation, to some extent. Summary The concept is simple. We subtract the older image data from the new image data. Concentrating on urban areas tends to be highly reflective and results in higher image pixel values. Resources for Article: Further resources on this subject: Prototyping Arduino Projects using Python [article] Python functions – Avoid repeating code [article] Pentesting Using Python [article]

0
0
3254

Packt

25 Mar 2015

21 min read

WLAN Encryption Flaws

Packt

25 Mar 2015

21 min read

0
0
11009

How-To Tutorials

article-image-cross-browser-tests-using-selenium-webdriver

Packt

25 Mar 2015

18 min read

Cross-browser Tests using Selenium WebDriver

Packt

25 Mar 2015

18 min read

0
0
6603

Packt

25 Mar 2015

6 min read

Prerequisites

Packt

25 Mar 2015

6 min read

In this article by Deepak Vohra, author of the book, Advanced Java® EE Development with WildFly® you will see how to create a Java EE project and its pre-requisites. (For more resources related to this topic, see here.) The objective of the EJB 3.x specification is to simplify its development by improving the EJB architecture. This simplification is achieved by providing metadata annotations to replace XML configuration. It also provides default configuration values by making entity and session beans POJOs (Plain Old Java Objects) and by making component and home interfaces redundant. The EJB 2.x entity beans is replaced with EJB 3.x entities. EJB 3.0 also introduced the Java Persistence API (JPA) for object-relational mapping of Java objects. WildFly 8.x supports EJB 3.2 and the JPA 2.1 specifications from Java EE 7. The sample application is based on Java EE 6 and EJB 3.1. The configuration of EJB 3.x with Java EE 7 is also discussed and the sample application can be used or modified to run on a Java EE 7 project. We have used a Hibernate 4.3 persistence provider. Unlike some of the other persistence providers, the Hibernate persistence provider supports automatic generation of relational database tables including the joining of tables. In this article, we will create an EJB 3.x project. This article has the following topics: Setting up the environment Creating a WildFly runtime Creating a Java EE project Setting up the environment We need to download and install the following software: WildFly 8.1.0.Final: Download wildfly-8.1.0.Final.zip from http://wildfly.org/downloads/. MySQL 5.6 Database-Community Edition: Download this edition from http://dev.mysql.com/downloads/mysql/. When installing MySQL, also install Connector/J. Eclipse IDE for Java EE Developers: Download Eclipse Luna from https://www.eclipse.org/downloads/packages/release/Luna/SR1. JBoss Tools (Luna) 4.2.0.Final: Install this as a plug-in to Eclipse from the Eclipse Marketplace (http://tools.jboss.org/downloads/installation.html). The latest version from Eclipse Marketplace is likely to be different than 4.2.0. Apache Maven: Download version 3.05 or higher from http://maven.apache.org/download.cgi. Java 7: Download Java 7 from http://www.oracle.com/technetwork/java/javase/downloads/index.html?ssSourceSiteId=ocomcn. Set the environment variables: JAVA_HOME, JBOSS_HOME, MAVEN_HOME, and MYSQL_HOME. Add %JAVA_HOME%/bin, %MAVEN_HOME%/bin, %JBOSS_HOME%/bin, and %MYSQL_HOME%/bin to the PATH environment variable. The environment settings used are C:wildfly-8.1.0.Final for JBOSS_HOME, C:Program FilesMySQLMySQL Server 5.6.21 for MYSQL_HOME, C:mavenapache-maven-3.0.5 for MAVEN_HOME, and C:Program FilesJavajdk1.7.0_51 for JAVA_HOME. Run the add-user.bat script from the %JBOSS_HOME%/bin directory to create a user for the WildFly administrator console. When prompted What type of user do you wish to add?, select a) Management User. The other option is b) Application User. Management User is used to log in to Administration Console, and Application User is used to access applications. Subsequently, specify the Username and Password for the new user. When prompted with the question, Is this user going to be used for one AS process to connect to another AS..?, enter the answer as no. When installing and configuring the MySQL database, specify a password for the root user (the password mysql is used in the sample application). Creating a WildFly runtime As the application is run on WildFly 8.1, we need to create a runtime environment for WildFly 8.1 in Eclipse. Select Window | Preferences in Eclipse. In Preferences, select Server | Runtime Environment. Click on the Add button to add a new runtime environment, as shown in the following screenshot: In New Server Runtime Environment, select JBoss Community | WildFly 8.x Runtime. Click on Next: In WildFly Application Server 8.x, which appears below New Server Runtime Environment, specify a Name for the new runtime or choose the default name, which is WildFly 8.x Runtime. Select the Home Directory for the WildFly 8.x server using the Browse button. The Home Directory is the directory where WildFly 8.1 is installed. The default path is C:wildfly-8.1.0.Final. Select the Runtime JRE as JavaSE-1.7. If the JDK location is not added to the runtime list, first add it from the JRE preferences screen in Eclipse. In Configuration base directory, select standalone as the default setting. In Configuration file, select standalone.xml as the default setting. Click on Finish: A new server runtime environment for WildFly 8.x Runtime gets created, as shown in the following screenshot. Click on OK: Creating a Server Runtime Environment for WildFly 8.x is a prerequisite for creating a Java EE project in Eclipse. In the next topic, we will create a new Java EE project for an EJB 3.x application. Creating a Java EE project JBoss Tools provides project templates for different types of JBoss projects. In this topic, we will create a Java EE project for an EJB 3.x application. Select File | New | Other in Eclipse IDE. In the New wizard, select the JBoss Central | Java EE EAR Project wizard. Click on the Next button: The Java EE EAR Project wizard gets started. By default, a Java EE 6 project is created. A Java EE EAR Project is a Maven project. The New Project Example window lists the requirements and runs a test for the requirements. The JBoss AS runtime is required and some plugins (including the JBoss Maven Tools plugin) are required for a Java EE project. Select Target Runtime as WildFly 8.x Runtime, which was created in the preceding topic. Then, check the Create a blank project checkbox. Click on the Next button: Specify Project name as jboss-ejb3, Package as org.jboss.ejb3, and tick the Use default Workspace location box. Click on the Next button: Specify Group Id as org.jboss.ejb3, Artifact Id as jboss-ejb3, Version as 1.0.0, and Package as org.jboss.ejb3.model. Click on Finish: A Java EE project gets created, as shown in the following Project Explorer window. The jboss-ejb3 project consists of three subprojects: jboss-ejb3-ear, jboss-ejb3-ejb, and jboss-ejb3-web. Each subproject consists of a pom.xml file for Maven. The jboss-ejb3-ejb subproject consists of a META-INF/persistence.xml file within the src/main/resources source folder for the JPA database persistence configuration. Summary In this article, we learned how to create a Java EE project and its prerequisites. Resources for Article: Further resources on this subject: Common performance issues [article] Running our ﬁrst web application [article] Various subsystem configurations [article]

0
0
5560

article-image-handling-image-and-video-files

Packt

24 Mar 2015

18 min read

Handling Image and Video Files

Packt

24 Mar 2015

18 min read

This article written by Gloria Bueno García, Oscar Deniz Suarez, José Luis Espinosa Aranda, Jesus Salido Tercero, Ismael Serrano Gracia, and Noelia Vállez Enanois, the authors of Learning Image Processing with OpenCV, is intended as a first contact with OpenCV, its installation, and first basic programs. We will cover the following topics: A brief introduction to OpenCV for the novice, followed by an easy step-by-step guide to the installation of the library A quick tour of OpenCV's structure after the installation in the user's local disk Quick recipes to create projects using the library with some common programming frameworks How to use the functions to read and write images and videos (For more resources related to this topic, see here.) An introduction to OpenCV Initially developed by Intel, OpenCV (Open Source Computer Vision) is a free cross-platform library for real-time image processing that has become a de facto standard tool for all things related to Computer Vision. The first version was released in 2000 under BSD license and since then, its functionality has been very much enriched by the scientific community. In 2012, the nonprofit foundation OpenCV.org took on the task of maintaining a support site for developers and users. At the time of writing this article, a new major version of OpenCV (Version 3.0) is available, still on beta status. Throughout the article, we will present the most relevant changes brought with this new version. OpenCV is available for the most popular operating systems, such as GNU/Linux, OS X, Windows, Android, iOS, and some more. The first implementation was in the C programming language; however, its popularity grew with its C++ implementation as of Version 2.0. New functions are programmed with C++. However, nowadays, the library has a full interface for other programming languages, such as Java, Python, and MATLAB/Octave. Also, wrappers for other languages (such as C#, Ruby, and Perl) have been developed to encourage adoption by programmers. In an attempt to maximize the performance of computing intensive vision tasks, OpenCV includes support for the following: Multithreading on multicore computers using Threading Building Blocks (TBB)—a template library developed by Intel. A subset of Integrated Performance Primitives (IPP) on Intel processors to boost performance. Thanks to Intel, these primitives are freely available as of Version 3.0 beta. Interfaces for processing on Graphic Processing Unit (GPU) using Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). The applications for OpenCV cover areas such as segmentation and recognition, 2D and 3D feature toolkits, object identification, facial recognition, motion tracking, gesture recognition, image stitching, high dynamic range (HDR) imaging, augmented reality, and so on. Moreover, to support some of the previous application areas, a module with statistical machine learning functions is included. Downloading and installing OpenCV OpenCV is freely available for download at http://opencv.org. This site provides the last version for distribution (currently, 3.0 beta) and older versions. Special care should be taken with possible errors when the downloaded version is a nonstable release, for example, the current 3.0 beta version. On http://opencv.org/releases.html, suitable versions of OpenCV for each platform can be found. The code and information of the library can be obtained from different repositories depending on the final purpose: The main repository (at http://sourceforge.net/projects/opencvlibrary), devoted to final users. It contains binary versions of the library and ready-to-compile sources for the target platform. The test data repository (at https://github.com/itseez/opencv_extra) with sets of data to test purposes of some library modules. The contributions repository (at http://github.com/itseez/opencv_contrib) with the source code corresponding to extra and cutting-edge features supplied by contributors. This code is more error-prone and less tested than the main trunk. With the last version, OpenCV 3.0 beta, the extra contributed modules are not included in the main package. They should be downloaded separately and explicitly included in the compilation process through the proper options. Be cautious if you include some of those contributed modules, because some of them have dependencies on third-party software not included with OpenCV. The documentation site (at http://docs.opencv.org/master/) for each of the modules, including the contributed ones. The development repository (at https://github.com/Itseez/opencv) with the current development version of the library. It is intended for developers of the main features of the library and the "impatient" user who wishes to use the last update even before it is released. Rather than GNU/Linux and OS X, where OpenCV is distributed as source code only, in the Windows distribution, one can find precompiled (with Microsoft Visual C++ v10, v11, and v12) versions of the library. Each precompiled version is ready to be used with Microsoft compilers. However, if the primary intention is to develop projects with a different compiler framework, we need to compile the library for that specific compiler (for example, GNU GCC). The fastest route to working with OpenCV is to use one of the precompiled versions included with the distribution. Then, a better choice is to build a fine-tuned version of the library with the best settings for the local platform used for software development. This article provides the information to build and install OpenCV on Windows. Further information to set the library on Linux can be found at http://docs.opencv.org/doc/tutorials/introduction/linux_install and https://help.ubuntu.com/community/OpenCV. Getting a compiler and setting CMake A good choice for cross-platform development with OpenCV is to use the GNU toolkit (including gmake, g++, and gdb). The GNU toolkit can be easily obtained for the most popular operating systems. Our preferred choice for a development environment consists of the GNU toolkit and the cross-platform Qt framework, which includes the Qt library and the Qt Creator Integrated Development Environment(IDE). The Qt framework is freely available at http://qt-project.org/. After installing the compiler on Windows, remember to properly set the Path environment variable, adding the path for the compiler's executable, for example, C:QtQt5.2.15.2.1mingw48_32bin for the GNU/compilers included with the Qt framework. On Windows, the free Rapid Environment Editor tool (available at http://www.rapidee.com) provides a convenient way to change Path and other environment variables. To manage the build process for the OpenCV library in a compiler-independent way, CMake is the recommended tool. CMake is a free and open source cross-platform tool available at http://www.cmake.org/. Configuring OpenCV with CMake Once the sources of the library have been downloaded into the local disk, it is required that you configure the makefiles for the compilation process of the library. CMake is the key tool for an easy configuration of OpenCV's installation process. It can be used from the command line or in a more user-friendly way with its Graphical User Interface (GUI) version. The steps to configure OpenCV with CMake can be summarized as follows: Choose the source (let's call it OPENCV_SRC in what follows) and target (OPENCV_BUILD) directories. The target directory is where the compiled binaries will be located. Mark the Grouped and Advanced checkboxes and click on the Configure button. Choose the desired compiler (for example, GNU default compilers, MSVC, and so on). Set the preferred options and unset those not desired. Click on the Configure button and repeat steps 4 and 5 until no errors are obtained. Click on the Generate button and close CMake. The following screenshot shows you the main window of CMake with the source and target directories and the checkboxes to group all the available options: The main window of CMake after the preconfiguration step For brevity, we use OPENCV_BUILD and OPENCV_SRC in this text to denote the target and source directories of the OpenCV local setup, respectively. Keep in mind that all directories should match your current local configuration. During the preconfiguration process, CMake detects the compilers present and many other local properties to set the build process of OpenCV. The previous screenshot displays the main CMake window after the preconfiguration process, showing the grouped options in red. It is possible to leave the default options unchanged and continue the configuration process. However, some convenient options can be set: BUILD_EXAMPLES: This is set to build some examples using OpenCV. BUILD_opencv_<module_name>: This is set to include the module (module_name) in the build process. OPENCV_EXTRA_MODULES_PATH: This is used when you need some extra contributed module; set the path for the source code of the extra modules here (for example, C:/opencv_contrib-master/modules). WITH_QT: This is turned on to include the Qt functionality into the library. WITH_IPP: This option is turned on by default. The current OpenCV 3.0 version includes a subset of the Intel Integrated Performance Primitives (IPP) that speed up the execution time of the library. If you compile the new OpenCV 3.0 (beta), be cautious because some unexpected errors have been reported related to the IPP inclusion (that is, with the default value of this option). We recommend that you unset the WITH_IPP option. If the configuration steps with CMake (loop through steps 4 and 5) don't produce any further errors, it is possible to generate the final makefiles for the build process. The following screenshot shows you the main window of CMake after a generation step without errors: Compiling and installing the library The next step after the generation process of makefiles with CMake is the compilation with the proper make tool. This tool is usually executed on the command line (the console) from the target directory (the one set at the CMake configuration step). For example, in Windows, the compilation should be launched from the command line as follows: OPENCV_BUILD>mingw32-make This command launches a build process using the makefiles generated by CMake. The whole compilation typically takes several minutes. If the compilation ends without errors, the installation continues with the execution of the following command: OPENCV_BUILD>mingw32-make install This command copies the OpenCV binaries to the OPENCV_BUILDinstall directory. If something went wrong during the compilation, we should run CMake again to change the options selected during the configuration. Then, we should regenerate the makefiles. The installation ends by adding the location of the library binaries (for example, in Windows, the resulting DLL files are located at OPENCV_BUILDinstallx64mingwbin) to the Path environment variable. Without this directory in the Path field, the execution of every OpenCV executable will give an error as the library binaries won't be found. To check the success of the installation process, it is possible to run some of the examples compiled along with the library (if the BUILD_EXAMPLES option was set using CMake). The code samples (written in C++) can be found at OPENCV_BUILDinstallx64mingwsamplescpp. The short instructions given to install OpenCV apply to Windows. A detailed description with the prerequisites for Linux can be read at http://docs.opencv.org/doc/tutorials/introduction/linux_install/linux_install.html. Although the tutorial applies to OpenCV 2.0, almost all the information is still valid for Version 3.0. The structure of OpenCV Once OpenCV is installed, the OPENCV_BUILDinstall directory will be populated with three types of files: Header files: These are located in the OPENCV_BUILDinstallinclude subdirectory and are used to develop new projects with OpenCV. Library binaries: These are static or dynamic libraries (depending on the option selected with CMake) with the functionality of each of the OpenCV modules. They are located in the bin subdirectory (for example, x64mingwbin when the GNU compiler is used). Sample binaries: These are executables with examples that use the libraries. The sources for these samples can be found in the source package (for example, OPENCV_SRCsourcessamples). OpenCV has a modular structure, which means that the package includes a static or dynamic (DLL) library for each module. The official documentation for each module can be found at http://docs.opencv.org/master/. The main modules included in the package are: core: This defines the basic functions used by all the other modules and the fundamental data structures including the important multidimensional array Mat. highgui: This provides simple user interface (UI) capabilities. Building the library with Qt support (the WITH_QT CMake option) allows UI compatibility with such a framework. imgproc: These are image processing functions that include filtering (linear and nonlinear), geometric transformations, color space conversion, histograms, and so on. imgcodecs: This is an easy-to-use interface to read and write images. Pay attention to the changes in modules since OpenCV 3.0 as some functionality has been moved to a new module (for example, reading and writing images functions were moved from highgui to imgcodecs). photo: This includes Computational Photography including inpainting, denoising, High Dynamic Range (HDR) imaging, and some others. stitching: This is used for image stitching. videoio: This is an easy-to-use interface for video capture and video codecs. video: This supplies the functionality of video analysis (motion estimation, background extraction, and object tracking). features2d: These are functions for feature detection (corners and planar objects), feature description, feature matching, and so on. objdetect: These are functions for object detection and instances of predefined detectors (such as faces, eyes, smile, people, cars, and so on). Some other modules are calib3d (camera calibration), flann (clustering and search), ml (machine learning), shape (shape distance and matching), superres (super resolution), video (video analysis), and videostab (video stabilization). As of Version 3.0 beta, the new contributed modules are distributed in a separate package (opencv_contrib-master.zip) that can be downloaded from https://github.com/itseez/opencv_contrib. These modules provide extra features that should be fully understood before using them. For a quick overview of the new functionality in the new release of OpenCV (Version 3.0), refer to the document at http://opencv.org/opencv-3-0-beta.html. Creating user projects with OpenCV In this article, we assume that C++ is the main language for programming image processing applications, although interfaces and wrappers for other programming languages are actually provided (for instance, Python, Java, MATLAB/Octave, and some more). In this section, we explain how to develop applications with OpenCV's C++ API using an easy-to-use cross-platform framework. General usage of the library To develop an OpenCV application with C++, we require our code to: Include the OpenCV header files with definitions Link the OpenCV libraries (binaries) to get the final executable The OpenCV header files are located in the OPENCV_BUILDinstallincludeopencv2 directory where there is a file (*.hpp) for each of the modules. The inclusion of the header file is done with the #include directive, as shown here: #include <opencv2/<module_name>/<module_name>.hpp>// Including the header file for each module used in the code With this directive, it is possible to include every header file needed by the user program. On the other hand, if the opencv.hpp header file is included, all the header files will be automatically included as follows: #include <opencv2/opencv.hpp>// Including all the OpenCV's header files in the code Remember that all the modules installed locally are defined in the OPENCV_BUILDinstallincludeopencv2opencv_modules.hpp header file, which is generated automatically during the building process of OpenCV. The use of the #include directive is not always a guarantee for the correct inclusion of the header files, because it is necessary to tell the compiler where to find the include files. This is achieved by passing a special argument with the location of the files (such as I<location> for GNU compilers). The linking process requires you to provide the linker with the libraries (dynamic or static) where the required OpenCV functionality can be found. This is usually done with two types of arguments for the linker: the location of the library (such as -L<location> for GNU compilers) and the name of the library (such as -l<module_name>). You can find a complete list of available online documentation for GNU GCC and Make at https://gcc.gnu.org/onlinedocs/ and https://www.gnu.org/software/make/manual/. Tools to develop new projects The main prerequisites to develop our own OpenCV C++ applications are: OpenCV header files and library binaries: Of course we need to compile OpenCV, and the auxiliary libraries are prerequisites for such a compilation. The package should be compiled with the same compiler used to generate the user application. A C++ compiler: Some associate tools are convenient as the code editor, debugger, project manager, build process manager (for instance CMake), revision control system (such as Git, Mercurial, SVN, and so on), and class inspector, among others. Usually, these tools are deployed together in a so-called Integrated Development Environment (IDE). Any other auxiliary libraries: Optionally, any other auxiliary libraries needed to program the final application, such as graphical, statistical, and so on will be required. The most popular available compiler kits to program OpenCV C++ applications are: Microsoft Visual C (MSVC): This is only supported on Windows and it is very well integrated with the IDE Visual Studio, although it can be also integrated with other cross-platform IDEs, such as Qt Creator or Eclipse. Versions of MSVC that currently compatible with the latest OpenCV release are VC 10, VC 11, and VC 12 (Visual Studio 2010, 2012, and 2013). GNU Compiler Collection GNU GCC: This is a cross-platform compiler system developed by the GNU project. For Windows, this kit is known as MinGW (Minimal GNU GCC). The version compatible with the current OpenCV release is GNU GCC 4.8. This kit may be used with several IDEs, such as Qt Creator, Code::Blocks, Eclipse, among others. For the examples presented in this article, we used the MinGW 4.8 compiler kit for Windows plus the Qt 5.2.1 library and the Qt Creator IDE (3.0.1). The cross-platform Qt library is required to compile OpenCV with the new UI capabilities provided by such a library. For Windows, it is possible to download a Qt bundle (including Qt library, Qt Creator, and the MinGW kit) from http://qt-project.org/. The bundle is approximately 700 MB. Qt Creator is a cross-platform IDE for C++ that integrates the tools we need to code applications. In Windows, it may be used with MinGW or MSVC. The following screenshot shows you the Qt Creator main window with the different panels and views for an OpenCV C++ project: The main window of Qt Creator with some views from an OpenCV C++ project Creating an OpenCV C++ program with Qt Creator Next, we explain how to create a code project with the Qt Creator IDE. In particular, we apply this description to an OpenCV example. We can create a project for any OpenCV application using Qt Creator by navigating to File | New File or File | Project… and then navigating to Non-Qt Project | Plain C++ Project. Then, we have to choose a project name and the location at which it will be stored. The next step is to pick a kit (that is, the compiler) for the project (in our case, Desktop Qt 5.2.1 MinGW 32 bit) and the location for the binaries generated. Usually, two possible build configurations (profiles) are used: debug and release. These profiles set the appropriate flags to build and run the binaries. When a project is created using Qt Creator, two special files (with .pro and .pro.user extensions) are generated to configure the build and run processes. The build process is determined by the kit chosen during the creation of the project. With the Desktop Qt 5.2.1 MinGW 32 bit kit, this process relies on the qmake and mingw32-make tools. Using the *.pro file as the input, qmake generates the makefile that drives the build process for each profile (that is, release and debug). The qmake tool is used from the Qt Creator IDE as an alternative to CMake to simplify the build process of software projects. It automates the generation of makefiles from a few lines of information. The following lines represent an example of a *.pro file (for example, showImage.pro): TARGET: showImage TEMPLATE = app CONFIG += console CONFIG -= app_bundle CONFIG -= qt SOURCES += showImage.cpp INCLUDEPATH += C:/opencv300-buildQt/install/include LIBS += -LC:/opencv300-buildQt/install/x64/mingw/lib -lopencv_core300.dll -lopencv_imgcodecs300.dll -lopencv_highgui300.dll -lopencv_imgproc300.dll The preceding file illustrates the options that qmake needs to generate the appropriate makefiles to build the binaries for our project. Each line starts with a tag indicating an option (TARGET, CONFIG, SOURCES, INCLUDEPATH, and LIBS) followed with a mark to add (+=) or remove (-=) the value of the option. In this sample project, we use the non-Qt console application. The executable file is showImage.exe (TARGET) and the source file is showImage.cpp (SOURCES). As this project is an OpenCV-based application, the two last tags indicate the location of the header files (INCLUDEPATH) and the OpenCV libraries (LIBS) used by this particular project (core, imgcodecs, highgui, and imgproc). Note that a backslash at the end of the line denotes continuation in the next line. For a detailed description of the tools (including Qt Creator and qmake) developed within the Qt project, visit http://doc.qt.io/. Summary This article has now covered an introduction to OpenCV, how to download and install OpenCV, the structure of OpenCV, and how to create user projects with OpenCV. Resources for Article: Further resources on this subject: Camera Calibration [article] OpenCV: Segmenting Images [article] New functionality in OpenCV 3.0 [article]

0
0
2084

How-To Tutorials

article-image-creating-custom-reports-and-notifications-vsphere

Packt

24 Mar 2015

34 min read

Creating Custom Reports and Notifications for vSphere

Packt

24 Mar 2015

34 min read

In this article by Philip Sellers, the author of PowerCLI Cookbook, you will cover the following topics: Getting alerts from a vSphere environment Basics of formatting output from PowerShell objects Sending output to CSV and HTML Reporting VM objects created during a predefined time period from VI Events object Setting custom properties to add useful context to your virtual machines Using PowerShell native capabilities to schedule scripts (For more resources related to this topic, see here.) This article is all about leveraging the information available to you in PowerCLI. As much as any other topic, figuring out how to tap into the data that PowerCLI offers is as important as understanding the cmdlets and syntax of the language. However, once you obtain your data, you will need to alter the formatting and how it's returned to be used. PowerShell, and by extension PowerCLI, offers a big set of ways to control the formatting and the display of information returned by its cmdlets and data objects. You will explore all of these topics with this article. Getting alerts from a vSphere environment Discovering the data available to you is the most difficult thing that you will learn and adopt in PowerCLI after learning the initial cmdlets and syntax. There is a large amount of data available to you through PowerCLI, but there are techniques to extract the data in a way that you can use. The Get-Member cmdlet is a great tool for discovering the properties that you can use. Sometimes, just listing the data returned by a cmdlet is enough; however, when the property contains other objects, Get-Member can provide context to know that the Alarm property is a Managed Object Reference (MoRef) data type. As your returned objects have properties that contain other objects, you can have multiple layers of data available for you to expose using PowerShell dot notation ($variable.property.property). The ExtensionData property found on most objects has a lot of related data and objects to the primary data. Sometimes, the data found in the property is an object identifier that doesn't mean much to an administrator but represents an object in vSphere. In these cases, the Get-View cmdlet can refer to that identifier and return human-readable data. This topic will walk you through the methods of accessing data and converting it to usable, human-readable data wherever needed so that you can leverage it in scripts. To explore these methods, we will take a look at vSphere's built-in alert system. While PowerCLI has native cmdlets to report on the defined alarm states and actions, it doesn't have a native cmdlet to retrieve the triggered alarms on a particular object. To do this, you must get the datacenter, VMhost, VM, and other objects directly and look at data from the ExtensionData property. Getting ready To begin this topic, you will need a PowerCLI window and an active connection to vCenter. You should also check the vSphere Web Client or the vSphere Windows Client to see whether you have any active alarms and to know what to expect. If you do not have any active VM alarms, you can simulate an alarm condition using a utility such as HeavyLoad. For more information on generating an alarm, see the There's more... section of this topic. How to do it… In order to access data and convert it to usable, human-readable data, perform the following steps: The first step is to retrieve all of the VMs on the system. A simple Get-VM cmdlet will return all VMs on the vCenter you're connected to. Within the VM object returned by Get-VM, one of the properties is ExtensionData. This property is an object that contains many additional properties and objects. One of the properties is TriggeredAlarmState: Get-VM | Where {$_.ExtensionData.TriggeredAlarmState -ne $null} To dig into TriggeredAlarmState more, take the output of the previous cmdlet and store it into a variable. This will allow you to enumerate the properties without having to wait for the Get-VM cmdlet to run. Add a Select -First 1 cmdlet to the command string so that only a single object is returned. This will help you look inside without having to deal with multiple VMs in the variable: $alarms = Get-VM | Where {$_.ExtensionData.TriggeredAlarmState -ne $null} | Select -First 1 Now that you have extracted an alarm, how do you get useful data about what type of alarm it is and which vSphere object has a problem? In this case, you have VM objects since you used Get-VM to find the alarms. To see what is in the TriggeredAlarmState property, output the contents of TriggeredAlarmState and pipe it to Get-Member or its shortcut GM: $alarms.ExtensionData.TriggeredAlarmState | GM The following screenshot shows the output of the preceding command line: List the data in the $alarms variable without the Get-Member cmdlet appended and view the data in a real alarm. The data returned does tell you the time when the alarm was triggered, the OverallStatus property or severity of the alarm, and whether the alarm has been acknowledged by an administrator, who acknowledged it and at what time. You will see that the Entity property contains a reference to a virtual machine. You can use the Get-View cmdlet on a reference to a VM, in this case, the Entity property, and return the virtual machine name and other properties. You will also see that Alarm is referred to in a similar way and we can extract usable information using Get-View also: Get-View $alarms.ExtensionData.TriggeredAlarmState.Entity Get-View $alarms.ExtensionData.TriggeredAlarmState.Alarm You can see how the output from these two views differs. The Entity view provides the name of the VM. You don't really need this data since the top-level object contains the VM name, but it's good to understand how to use Get-View with an entity. On the other hand, the data returned by the Alarm view does not show the name or type of the alarm, but it does include an Info property. Since this is the most likely property with additional information, you should list its contents. To do so, enclose the Get-View cmdlet in parenthesis and then use dot notation to access the Info variable: (Get-View $alarms.ExtensionData.TriggeredAlarmState.Alarm).Info In the output from the Info property, you can see that the example alarm in the screenshot is a Virtual Machine CPU usage alarm. Your alarm can be different, but it should appear similar to this. After retrieving PowerShell objects that contain the data that you need, the easiest way to return the data is to use calculated expressions. Since the Get-VM cmdlet was the source for all lookup data, you will need to use this object with the calculated expressions to display the data. To do this, you will append a Select statement after the Get-VM and Where statement. Notice that you use the same Get-View statement, except that you change your variable to $_, which is the current object being passed into Select: Get-VM | Where {$_.ExtensionData.TriggeredAlarmState -ne $null} | Select Name, @{N="AlarmName";E={(Get-View $_.ExtensionData. TriggeredAlarmState.Alarm).Info.Name}}, @{N="AlarmDescription";E={(Get-View $_.ExtensionData. TriggeredAlarmState.Alarm).Info.Description}}, @ {N="TimeTriggered"; E={$_.ExtensionData.TriggeredAlarmState. Time}}, @{N="AlarmOverallStatus"; E={$_.ExtensionData. TriggeredAlarmState. OverallStatus}} How it works When the data you really need is several levels below the top-level properties of a data object, you need to use calculated expressions to return these at the top level. There are other techniques where you can build your own object with only the data you want returned, but in a large environment with thousands of objects in vSphere, the method in this topic will execute faster than looping through many objects to build a custom object. Calculated expressions are extremely powerful since nearly anything can be done with expressions. More than that, you explored techniques to discover the data you want. Data exploration can provide you with incredible new capabilities. The point is you need to know where the data is and how to pull that data back to the top level. There's more… It is likely that your test environment has no alarms and in this case, it might be up to you to create an alarm situation. One of the easiest to control and create is heavy CPU load with a CPU load-testing tool. JAM Software created software named HeavyLoad that is a stress-testing tool. This utility can be loaded into any Windows VM on your test systems and can consume all of the available CPU that the VM is configured with. To be safe, configure the VM with a single vCPU and the utility will consume all of the available CPU. Once you install the utility, go to the Test Options menu and you can uncheck the Stress GPU option, ensure that Stress CPU and Allocate Memory are checked. The utility also has shortcut buttons on the Menu bar to allow you to set these options. Click on the Start button (which looks like a Play button) and the utility begins to stress the VM. For users who wish to do the same test, but utilize Linux, StressLinux is a great option. StressLinux is a minimal distribution designed to create high load on an operating system. See also You can read more about the HeavyLoad Utility available under the JAM Software page at http://www.jam-software.com/heavyload/ You can read more about StressLinux at http://www.stresslinux.org/sl/ Basics of formatting output from PowerShell objects Anything that exists in a PowerShell object can be output as a report, e-mail, or editable file. Formatting the output is a simple task in PowerShell. Sometimes, the information you receive in the object is in a long decimal number format, but to make it more readable, you want to truncate the output to just a couple decimal places. In this section, you will take a look at the Format-Table, Format-Wide, and Format-List cmdlets. You will dig into the Format-Custom cmdlet and also take a look at the -f format operator. The truth is that native cmdlets do a great job returning data using default formatting. When we start changing and adding our own data to the list of properties returned, the formatting can become unoptimized. Even in the returned values of a native cmdlet, some columns might be too narrow to display all of the information. Getting ready To begin this topic, you will need the PowerShell ISE. How to do it… In order to format the output from PowerShell objects, perform the following steps: Run Add-PSSnapIn VMware.VimAutomation.Core in the PowerShell ISE to initialize a PowerCLI session and bring in the VMware cmdlet. Connect to your vCenter server. Start with a simple object from a Get-VM cmdlet. The default output is in a table format. If you pipe the object to Format-Wide, it will change the default output into a multicolumn with a single property, just like running a dir /w command at the Windows Command Prompt. You can also use FW, an alias for Format-Wide: Get-VM | Format-Wide Get-VM | FW If you take the same object and pipe it to Format-Table or its alias FT, you will receive the same output if you use the default output for Get-VM: Get-VM Get-VM | Format-Table However, as soon as you begin to select a different order of properties, the default formatting disappears. Select the same four properties and watch the formatting change. The default formatting disappears. Get-VM | Select Name, PowerState, NumCPU, MemoryGB | FT To restore formatting to table output, you have a few choices. You can change the formatting on the data in the object using the Select statement and calculated expressions. You can also pass formatting through the Format-Table cmdlet. While setting formatting in the Select statement changes the underlying data, using Format-Table doesn't change the data, but only its display. The formatting looks essentially like a calculated expression in a Select statement. You provide Label, Expression, and formatting commands: Get-VM | Select * | FT Name, PowerState, NumCPU, @{Label="MemoryGB"; Expression={$_.MemoryGB}; FormatString="N2"; Alignment="left"} If you have data in a number data type, you can convert it into a string using the ToString() method on the object. You can try this method on NumCPU: Get-VM | Select * | FT Name, PowerState, @{Label="Num CPUs"; Expression={($_.NumCpu).ToString()}; Alignment="left"}, @{Label="MemoryGB"; Expression={$_.MemoryGB}; FormatString="N2"; Alignment="left"} The other method is to format with the -f operator, which is basically a .NET derivative. To better understand the formatting and string, the structure is {<index>[,<alignment>][:<formatString>]}. Index sets that are a part of the data being passed, will be transformed. The alignment is a numeric value. A positive number will right-align those number of characters. A negative number will left-align those number of characters. The formatString parameter is the part that defines the format to apply. In this example, let's take a datastore and compute the percentage of free disk space. The format for percent is p: Get-Datastore | Select Name, @{N="FreePercent";E={"{0:p} -f ($_.FreeSpaceGB / $_.CapacityGB)}} To make the FreePercent column 15 characters wide, you add 0,15:p to the format string: Get-Datastore | Select Name, @{N="FreePercent";E={"{0,15:p} -f ($_.FreeSpaceGB / $_.CapacityGB)}} How it works… With the Format-Table, Format-List, and Format-Wide cmdlets, you can change the display of data coming from a PowerCLI object. These cmdlets all apply basic transformations without changing the data in the object. This is important to note because once the data is changed, it can prevent you from making changes. For instance, if you take the percentage example, after transforming the FreePercent property, it is stored as a string and no longer as a number, which means that you can't reformat it again. Applying a similar transformation from the Format-Table cmdlet would not alter your data. This doesn't matter when you're performing a one-liner, but in a more complex script or in a routine, where you might need to not only output the data but also reuse it, changing the data in the object is a big deal. There's more… This topic only begins to tap the full potential of PowerShell's native -f format operator. There are hundreds of blog posts about this topic, and there are use cases and examples of how to produce the formatting that you are looking for. The following link also gives you more details about the operator and formatting strings that you can use in your own code. See also For more information, refer to the PowerShell -f Format operator page available at http://ss64.com/ps/syntax-f-operator.html Sending output to CSV and HTML On the screen the output is great, but there are many times when you need to share your results with other people. When looking at sharing information, you want to choose a format that is easy to view and interpret. You might also want a format that is easy to manipulate and change. Comma Separated Values (CSV) files allow the user to take the output you generate and use it easily within a spreadsheet software. This allows you the ability to compare the results from vSphere versus internal tracking databases or other systems easily to find differences. It can also be useful to compare against service contracts for physical hosts as examples. HTML is a great choice for displaying information for reading, but not manipulation. Since e-mails can be in an HTML format, converting the output from PowerCLI (or PowerShell) into an e-mail is an easy way to assemble an e-mail to other areas of the business. What's even better about these cmdlets is the ease of use. If you have a data object in PowerCLI, all that you need to do is pipe that data object into the ConvertTo-CSV or ConvertTo-HTML cmdlets and you instantly get the formatted data. You might not be satisfied with the HTML-generated version alone, but like any other HTML, you can transform the look and formatting of the HTML using CSS by adding a header. In this topic, you will examine the conversion cmdlets with a simple set of Get- cmdlets. You will also take a look at trimming results using the Select statements and formatting HTML results with CSS. This topic will pull a list of virtual machines and their basic properties to send to a manager who can reconcile it against internal records or system monitoring. It will export to a CSV file that will be attached to the e-mail and you will use the HTML to format a list in an e-mail to send to the manager. Getting ready To begin this topic, you will need to use the PowerShell ISE. How to do it… In order to examine the conversion cmdlets using Get- cmdlets, trim results using the Select statements, and format HTML results with CSS, perform the following steps: Open the PowerShell ISE and run Add-PSSnapIn VMware.VimAutomation.Core to initialize a PowerCLI session within the ISE. Again, you will use the Get-VM cmdlet as the base for this topic. The fields that we care about are the name of the VM, the number of CPUs, the amount of memory, and the description: $VMs = Get-VM | Select Name, NumCPU, MemoryGB, Description In addition to the top-level data, you also want to provide the IP address, hostname, and the operating system. These are all available from the ExtensionData.Guest property: $VMs = Get-VM | Select Name, NumCPU, MemoryGB, Description, @{N="Hostname";E={$_.ExtensionData.Guest.HostName}}, @{N="IP";E={$_.ExtensionData.Guest.IPAddress}}, @{N="OS";E={$_.ExtensionData.Guest.GuestFullName}} The next step is to take this data and format it to be sent as an HTML e-mail. Converting the information to HTML is actually easy. Pipe the variable you created with the data into ConvertTo-HTML and store in a new variable. You will need to reuse the data to convert it to a CSV file to attach: $HTMLBody = $VMs | ConvertTo-HTML If you were to output the contents of $HTMLBody, you will see that it is very plain, inheriting the defaults of the browser or e-mail program used to display it. To dress this up, you need to define some basic CSS to add some style for the <body>, <table>, <tr>, <td>, and <th> tags. You can add this by running the ConvertTo-HTML cmdlet again with the -PreContent parameter: $css = "<style> body { font-family: Verdana, sans-serif; fontsize: 14px; color: #666; background: #FFF; } table{ width:100%; border-collapse:collapse; } table td, table th { border:1px solid #333; padding: 4px; } table th { text-align:left; padding: 4px; background-color:#BBB; color:#FFF;} </style>" $HTMLBody = $VMs | ConvertTo-HTML -PreContent $css It might also be nice to add the date and time generated to the end of the file. You can use the -PostContent parameter to add this: $HTMLBody = $VMs | ConvertTo-HTML -PreContent $css -PostContent "<div><strong>Generated:</strong> $(Get-Date)</div>" Now, you have the HTML body of your message. To take the same data from $VMs and save it to a CSV file that you can use, you will need a writable directory, and a good choice is to use your My Documents folder. You can obtain this using an environment variable: $tempdir = [environment]::getfolderpath("mydocuments") Now that you have a temp directory, you can perform your export. Pipe $VMs to Export-CSV and specify the path and filename: $VMs | Export-CSV $tempdirVM_Inventory.csv At this point, you are ready to assemble an e-mail and send it along with your attachment. Most of the cmdlets are straightforward. You set up a $msg variable that is a MailMessage object. You create an Attachment object and populate it with your temporary filename and then create an SMTP server with the server name: $msg = New-Object Net.Mail.MailMessage $attachment = new-object Net.Mail.Attachment("$tempdirVM_Inventory.csv") $smtpServer = New-Object Net.Mail.SmtpClient("hostname") You set the From, To, and Subject parameters of the message variable. All of these are set with dot notation on the $msg variable: $msg.From = "fromaddress@yourcompany.com" $msg.To.Add("admin@yourcompany.com") $msg.Subject = "Weekly VM Report" You set the body you created earlier, as $HTMLBody, but you need to run it through Out-String to convert any other data types to a pure string for e-mailing. This prevents an error where System.String[] appears instead of your content in part of the output: $msg.Body = $HTMLBody | Out-String You need to take the attachment and add it to the message: $msg.Attachments.Add($attachment) You need to set the message to an HTML format; otherwise, the HTML will be sent as plain text and not displayed as an HTML message: $msg.IsBodyHtml = $true Finally, you are ready to send the message using the $smtpServer variable that contains the mail server object. Pass in the $msg variable to the server object using the Send method and it transmits the message via SMTP to the mail server: $smtpServer.Send($msg) Don't forget to clean up the temporary CSV file you generated. To do this, use the PowerShell Remove-Item cmdlet that will remove the file from the filesystem. Add a -Confirm parameter to suppress any prompts: Remove-Item $tempdirVM_Inventory.csv -Confirm:$false How it works… Most of this topic relies on native PowerShell and less on the PowerCLI portions of the language. This is the beauty of PowerCLI. Since it is based on PowerShell and only an extension, you lose none of the functions of PowerShell, a very powerful set of commands in its own right. The ConvertTo-HTML cmdlet is very easy to use. It requires no parameters to produce HTML, but the HTML it produces isn't the most legible if you display it. However, a bit of CSS goes a long way to improve the look of the output. Add some colors and style to the table and it becomes a really easy and quick way to format a mail message of data to be sent to a manager on a weekly basis. The Export-CSV cmdlet lets you take the data returned by a cmdlet and convert that into an editable format for use. You can place this onto a file share for use or you can e-mail it along, as you did in this topic. This topic takes you step by step through the process of creating a mail message, formatting it in HTML, and making sure that it's relayed as an HTML message. You also looked at how to attach a file. To send a mail, you define a mail server as an object and store it in a variable for reuse. You create a message object and store it in a variable and then set all of the appropriate configuration on the message. For an attachment, you create a third object and define a file to be attached. That is set as a property on the message object and then finally, the message object is sent using the server object. There's more… ConvertTo-HTML is just one of four conversion cmdlets in PowerShell. In addition to ConvertTo-HTML, you can convert data objects into XML. ConvertTo-JSON allows you to convert a data object into an XML format specific for web applications. ConvertTo-CSV is identical to Export-CSV except that it doesn't save the content immediately to a defined file. If you had a use case to manipulate the CSV before saving it, such as stripping the double quotes or making other alternations to the contents, you can use ConvertTo-CSV and then save it to a file at a later point in your script. Reporting VM objects created during a predefined time period from VI Events object An important auditing tool in your environment can be a report of when virtual machines were created, cloned, or deleted. Unlike snapshots, that store a created date on the snapshot, virtual machines don't have this property associated with them. Instead, you have to rely on the events log in vSphere to let you know when virtual machines were created. PowerCLI has the Get-VIEvents cmdlet that allows you to retrieve the last 1,000 events on the vCenter, by default. The cmdlet can accept a parameter to include more than the last 1,000 events. The cmdlet also allows you to specify a start date, and this can allow you to search for things within the past week or the past month. At a high level, this topic works the same in both PowerCLI and the vSphere SDK for Perl (VIPerl). They both rely on getting the vSphere events and selecting the specific events that match your criteria. Even though you are looking for VM creation events in this topic, you will see that the code can be easily adapted to look for many other types of events. Getting ready To begin this topic, you will need a PowerCLI window and an active connection to a vCenter server. How to do it… In order to report VM objects created during a predefined time period from VI Events object, perform the following steps: You will use the Get-VIEvent cmdlet to retrieve the VM creation events for this topic. To begin, get a list of the last 50 events from the vCenter host using the -MaxSamples parameter: Get-VIEvent -MaxSamples 50 If you pipe the output from the preceding cmdlet to Get-Member, you will see that this cmdlet can return a lot of different objects. However, the type of object isn't really what you need to find the VM's created events. Looking through the objects, they all include a GetType() method that returns the type of event. Inside the type, there is a name parameter. Create a calculated expression using GetType() and then group it based on this expression, you will get a usable list of events you can search for. This list is also good for tracking the number of events your systems have encountered or generated: Get-VIEvent -MaxSamples 2000 | Select @{N="Type";E={$_.GetType().Name}} | Group Type In the preceding screenshot, you will see that there are VMClonedEvent, VmRemovedEvent, and VmCreatedEvent listed. All of these have to do with creating or removing virtual machines in vSphere. Since you are looking for created events, VMClonedEvent and VmCreatedEvent are the two needed for this script. Write a Where statement to return only these events. To do this, we can use a regular expression with both the event names and the -match PowerShell comparison parameter: Get-VIEvent -MaxSamples 2000 | Where {$_.GetType().Name -match "(VmCreatedEvent|VmClonedEvent)"} Next, you want to select just the properties that you want in your output. To do this, add a Select statement and you will reuse the calculated expression from Step 3. If you want to return the VM name, which is in a Vm property with the type of VMware.Vim.VVmeventArgument, you can create a calculated expression to return the VM name. To round out the output, you can include the FullFormattedMessage, CreatedTime, and UserName properties: Get-VIEvent -MaxSamples 2000 | Where {$_.GetType().Name -match "(VmCreatedEvent|VmClonedEvent)"} | Select @{N="Type",E={$_.GetType().Name}}, @{N="VMName",E={$_.Vm.Name}},FullFormattedMessage, CreatedTime, UserName The last thing you will want to do is go back and add a time frame to the Get-VIEvent cmdlet. You can do this by specifying the -Start parameter along with (Get-Date).AddMonths(-1) to return the last month's events: Get-VIEvent -MaxSamples 2000 | Where {$_.GetType().Name -match "(VmCreatedEvent|VmClonedEvent)"} | Select @{N="Type",E={$_.GetType().Name}}, @{N="VMName",E={$_.Vm.Name}},FullFormattedMessage, CreatedTime, UserName How it works… The Get-VIEvent cmdlet drives a majority of this topic, but in this topic you only scratched the surface of the information you can unearth with Get-VIEvent. As you saw in the screenshot, there are so many different types of events that can be reported, queried, and acted upon from the vCenter server. Once you discover and know which events you are looking for specifically, then it's a matter of scoping down the results with a Where statement. Last, you use calculated expressions to pull data that is several levels deep in the returned data object. One of the primary things employed here is a regular expression used to search for the types of events you were interested in: VmCreatedEvent and VmClonedEvent. By combining a regular expression with the -match operator, you were able to use a quick and very understandable bit of code to find more than one type of object you needed to return. There's more… Regular Expressions (RegEx) are big topics on their own. These types of searches can match any type of pattern that you can establish or in the case of this topic, a number of defined values that you are searching for. RegEx are beyond the scope of this article, but they can be a big help anytime you have a pattern you need to search for and match, or perhaps more importantly, replace. You can use the -replace operator instead of –match to not only to find things that match your pattern, but also change them. See also For more information on Regular Expressions refer to http://ss64.com/ps/syntax-regex.html The PowerShell.com page: Text and Regular Expressions is available at http://powershell.com/cs/blogs/ebookv2/archive/2012/03/20/chapter-13-text-and-regular-expressions.aspx Setting custom properties to add useful context to your virtual machines Building on the use case for the Get-VIEvent cmdlet, Alan Renouf of VMware's PowerCLI team has a useful script posted on his personal blog (refer to the See also section) that helps you pull the created date and the user who created a virtual machine and populate this into a custom attribute. This is a great use for a custom attribute on virtual machines and makes some useful information available that is not normally visible. This is a process that needs to be run often to pick up details for virtual machines that have been created. Rather than looking specifically at a VM and trying to go back and find its creation date as Alan's script does, in this article, you will take a different approach building on the previous article and populate the information from the found creation events. Maintenance in this form would be easier by finding creation events for the last week, running the script weekly, and updating the VMs with the data in the object rather than looking for VMs with missing data and searching through all of the events. This article assumes that you are using a Windows system that is joined to AD on the same domain as your vCenter. It also assumes that you have loaded the Remote Server Administration Tools for Windows so that the Active Directory PowerShell modules are available. This is a separate download for Windows 7. The Active Directory Module for PowerShell can be enabled on Windows 7, Windows 8, Windows Server 2008, and Windows Server 2012 in the Programs and Features control panel under Turn Windows features on or off. Getting ready To begin this script, you will need the PowerShell ISE. How to do it… I order to set custom properties to add useful context to your virtual machines, perform the following steps: Open the PowerShell ISE and run Add-PSSnapIn VMware.VimAutomation.Core to initialize a PowerCLI session within the ISE. The first step is to create a custom attribute in vCenter for the CreatedBy and CreateDate attributes: New-CustomAttribute -TargetType VirtualMachine -Name CreatedBy New-CustomAttribute -TargetType VirtualMachine -Name CreateDate Before you begin the scripting, you will need to run ImportSystemModules to bring in the Active Directory cmdlets that you will use later to lookup the username and reference it back to a display name: ImportSystemModules Next, you need to locate and pull out all of the creation events with the same code as the Reporting VM objects created during a predefined time period from VI Events object topic. You will assign the events to a variable for processing in a loop in this case; however, you will also want to change the period to 1 week (7 days) instead of 1 month: $Events = Get-VIEvent -Start (Get-Date).AddDays(-7) -MaxSamples25000 | Where {$_.GetType().Name -match "(VmCreatedEvent|VmClonedEvent)"} The next step is to begin a ForEach loop to pull the data and populate it into a custom attribute: ForEach ($Event in $Events) { The first thing to do in the loop is to look up the VM referenced in the Event's Vm parameter by name using Get-VM: $VM = Get-VM -Name $Event.Vm.Name Next, you can use the CreatedTime parameter on the event and set this as a custom attribute on the VM using the Set-Annotation cmdlet: $VM | Set-Annotation -CustomAttribute "CreateDate" -Value $Event.CreatedTime Next, you can use the Username parameter to lookup the display name of the user account who created the VM using Active Directory cmdlets. For the Active Directory cmdlets to be available, your client system or server needs to have the Microsoft Remote Server Administration Tools (RSAT) installed to make the Active Directory cmdlets available. The data coming from $Event.Username is in DOMAINusername format. You need just the username to perform a lookup with Get-AdUser, so that you can split on the backslash and return only the second item in the array resulting from the split command. After the lookup, the display name that you will want to use is in the Name property. You can retrieve it with dot notation: $User = (($Event.UserName.split(""))[1]) $DisplayName = (Get-AdUser $User).Name To do this, you need to use a built-in on the event and set this as a custom attribute on the VM using the Set-Annotation cmdlet: $VM | Set-Annotation -CustomAttribute "CreatedBy" -Value $DisplayName Finally, close the ForEach loop. } <# End ForEach #> How it works… This topic works by leveraging the Get-VIEvent cmdlet to search for events in the log from the last number of days. In larger environments, you might need to expand the -MaxSamples cmdlet well beyond the number in this example. There might be thousands of events per day in larger environments. The topic looks through the log and the Where statement returns only the creation events. Once you have the object with all of the creation events, you can loop through this and pull out the username of the person who created each virtual machine and the time they were created. Then, you just need to populate the data into the custom attributes created. There's more… Combine this script with the next topic and you have a great solution for scheduling this routine to run on a daily basis. Running it daily would certainly cut down on the number of events you need to process through to find and update the virtual machines that have been created with the information. You should absolutely go and read Alan Renouf's blog post on which this topic is based. This primary difference between this topic and the one Alan presents is the use of native Windows Active Directory PowerShell lookups in this topic instead of the Quest Active Directory PowerShell cmdlets. See also Virtu-Al.net: Who created that VM? is available at http://www.virtu-al.net/2010/02/23/who-created-that-vm/ Using PowerShell native capabilities to schedule scripts There is potentially a better and easier way to schedule your processes to run from PowerShell and PowerCLI and those are known as Scheduled Jobs. Scheduled Jobs were introduced in PowerShell 3.0 and distributed as part of the Windows Management Framework 3.0 and higher. While Scheduled Tasks can execute any Windows batch file or executable, Scheduled Jobs are specific to PowerShell and are used to generate and create background jobs that run once or on a specified schedule. Scheduled Jobs appear in the Windows Task Scheduler and can be managed with the scheduled task cmdlets of PowerShell. The only difference is that the scheduled jobs cmdlets cannot manage scheduled tasks. These jobs are stored in the MicrosoftWindowsPowerShellScheduledJobs path of the Windows Task Scheduler. You can see and edit them through the management console in Windows after creation. What's even greater about Scheduled Jobs in PowerShell is that you are not forced into creating a .ps1 file for every new job you need to run. If you have a PowerCLI one-liner that provides all of the functionality you need, you can simply include it in a job creation cmdlet without ever needing to save it anywhere. Getting ready To being this topic, you will need a PowerCLI window with an active connection to a vCenter server. How to do it… In order to schedule scripts using the native capabilities of PowerShell, perform the following steps: If you are running PowerCLI on systems lower than Windows 8 or Windows Server 2012, there's a chance that you are running PowerShell 2.0 and you will need to upgrade in order to use this. To check, run Get-PSVersion to see which version is installed on your system. If less than version 3.0, upgrade before continuing this topic. Throw back a script you have already written, the script to find and remove snapshots over 30 days old: Get-Snapshot -VM * | Where {$_.Created -LT (Get-Date).AddDays(-30)} | Remove-Snapshot -Confirm:$false To schedule a new job, the first thing you need to think about is what triggers your job to run. To define a new trigger, you use the New-JobTrigger cmdlet: $WeeklySundayAt6AM = New-JobTrigger -Weekly -At "6:00 AM" -DaysOfWeek Sunday –WeeksInterval 1 Like scheduled tasks, there are some options that can be set for a scheduled job. These include whether to wake the system to run: $Options = New-ScheduledJobOption –WakeToRun –StartIfIdle –MultipleInstancePolicy Queue Next, you will use the Register-ScheduledJob cmdlet. This cmdlet accepts a parameter named ScriptBlock and this is where you will specify the script that you have written. This method works best with one-liners, or scripts that execute in a single line of piped cmdlets. Since this is PowerCLI and not just PowerShell, you will need to add the VMware cmdlets and connect to vCenter at the beginning of the script block. You also need to specify the -Trigger and -ScheduledJobOption parameters that are defined in the previous two steps: Register-ScheduledJob -Name "Cleanup 30 Day Snapshots"-ScriptBlock { Add-PSSnapIn VMware.VimAutomation.Core; Connect-VIServer servers; Get-Snapshot -VM * | Where {$_.Created -LT (Get-Date).AddDays(-30)} | Remove-Snapshot -Confirm:$false} -Trigger$WeeklySundayAt6AM-ScheduledJobOption $Options You are not restricted to only running a script block. If you have a routine in a .ps1 file, you can easily run it from ScheduledJob also. For illustration, if you have a .ps1 file stored in c:Scripts named 30DaySnaps.ps1, you can use the following cmdlet to register a job: Register-ScheduledJob -Name "Cleanup 30 Day Snapshots"–FilePath c:Scripts30DaySnaps.ps1 -Trigger $WeeklySundayAt6AM-ScheduledJobOption $Options Rather than scheduling the scheduled job and defining the PowerShell in the job, a more maintainable method can be to write the module and then call the function from the scheduled job. One other requirement is that Single Sign-On should be configured so that the Connect-VIServer works correctly in the script: Register-ScheduledJob -Name "Cleanup 30 Day Snapshots"-ScriptBlock {Add-PSSnapIn VMware.VimAutomation.Core; Connect-ViServer server; Import-Module 30DaySnaps; Remove-30DaySnaps -VM*} - How it works… This topic leverages the scheduled jobs framework developed specifically for running PowerShell as scheduled tasks. It doesn't require you to configure all of the extra settings as you have seen in previous examples of scheduled tasks. These are PowerShell native cmdlets that know how to implement PowerShell on a schedule. One thing to keep in mind is that these jobs will begin with a normal PowerShell session—one that knows nothing about PowerCLI, by default. You will need to include Add-PSSnapIn VMware.VimAutomation.Core in each script block or the .ps1 file that you use with a scheduled job. There's more… There is a full library of cmdlets to implement and maintain scheduled jobs. You have Set-ScheduleJob that allows you to change the settings of a registered scheduled job on a Windows system. You can disable and enable scheduled jobs using the Disable-ScheduledJob and Enable-Scheduled job cmdlets. This allows you to pause the execution of a job during maintenance, or for other reasons, without needing to remove and resetup the job. This is especially helpful since the script blocks are inside the job and not saved in a separate .ps1 file. You can also configure remote scheduled jobs on other systems using the Invoke-Command PowerShell cmdlet. This concept is shown in examples on Microsoft TechNet in the documentation for the Register-ScheduledJob cmdlet. In addition to scheduling new jobs, you can remove jobs using the Unregister-ScheduledJob cmdlet. This cmdlet requires one of three identifying properties to unschedule a job. You can pass -Name with a string, -ID with the number identifying the job, or an object reference to the scheduled job with -InputObject. You can combine the Get-ScheduledJob cmdlet to find and pass the object by pipeline. See also To read more about Microsoft TechNet: PSScheduledJob Cmdlets, refer to http://technet.microsoft.com/en-us/library/hh849778.aspx Summary This article was all about leveraging the information and data from PowerCLI as well as how we can format and display this information. Resources for Article: Further resources on this subject: Introduction to vSphere Distributed switches [article] Creating an Image Profile by cloning an existing one [article] VMware View 5 Desktop Virtualization [article]

0
0
8874

article-image-classifying-real-world-examples

Packt

24 Mar 2015

32 min read

Classifying with Real-world Examples

Packt

24 Mar 2015

32 min read

This article by the authors, Luis Pedro Coelho and Willi Richert, of the book, Building Machine Learning Systems with Python - Second Edition, focuses on the topic of classification. (For more resources related to this topic, see here.) You have probably already used this form of machine learning as a consumer, even if you were not aware of it. If you have any modern e-mail system, it will likely have the ability to automatically detect spam. That is, the system will analyze all incoming e-mails and mark them as either spam or not-spam. Often, you, the end user, will be able to manually tag e-mails as spam or not, in order to improve its spam detection ability. This is a form of machine learning where the system is taking examples of two types of messages: spam and ham (the typical term for "non spam e-mails") and using these examples to automatically classify incoming e-mails. The general method of classification is to use a set of examples of each class to learn rules that can be applied to new examples. This is one of the most important machine learning modes and is the topic of this article. Working with text such as e-mails requires a specific set of techniques and skills. For the moment, we will work with a smaller, easier-to-handle dataset. The example question for this article is, "Can a machine distinguish between flower species based on images?" We will use two datasets where measurements of flower morphology are recorded along with the species for several specimens. We will explore these small datasets using a few simple algorithms. At first, we will write classification code ourselves in order to understand the concepts, but we will quickly switch to using scikit-learn whenever possible. The goal is to first understand the basic principles of classification and then progress to using a state-of-the-art implementation. The Iris dataset The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification. The dataset is a collection of morphological measurements of several Iris flowers. These measurements will enable us to distinguish multiple species of the flowers. Today, species are identified by their DNA fingerprints, but in the 1930s, DNA's role in genetics had not yet been discovered. The following four attributes of each plant were measured: sepal length sepal width petal length petal width In general, we will call the individual numeric measurements we use to describe our data features. These features can be directly measured or computed from intermediate data. This dataset has four features. Additionally, for each plant, the species was recorded. The problem we want to solve is, "Given these examples, if we see a new flower out in the field, could we make a good prediction about its species from its measurements?" This is the supervised learning or classification problem: given labeled examples, can we design a rule to be later applied to other examples? A more familiar example to modern readers who are not botanists is spam filtering, where the user can mark e-mails as spam, and systems use these as well as the non-spam e-mails to determine whether a new, incoming message is spam or not. For the moment, the Iris dataset serves our purposes well. It is small (150 examples, four features each) and can be easily visualized and manipulated. Visualization is a good first step Datasets will grow to thousands of features. With only four in our starting example, we can easily plot all two-dimensional projections on a single page. We will build intuitions on this small example, which can then be extended to large datasets with many more features. Visualizations are excellent at the initial exploratory phase of the analysis as they allow you to learn the general features of your problem as well as catch problems that occurred with data collection early. Each subplot in the following plot shows all points projected into two of the dimensions. The outlying group (triangles) are the Iris Setosa plants, while Iris Versicolor plants are in the center (circle) and Iris Virginica are plotted with x marks. We can see that there are two large groups: one is of Iris Setosa and another is a mixture of Iris Versicolor and Iris Virginica. In the following code snippet, we present the code to load the data and generate the plot: >>> from matplotlib import pyplot as plt >>> import numpy as np >>> # We load the data with load_iris from sklearn >>> from sklearn.datasets import load_iris >>> data = load_iris() >>> # load_iris returns an object with several fields >>> features = data.data >>> feature_names = data.feature_names >>> target = data.target >>> target_names = data.target_names >>> for t in range(3): ... if t == 0: ... c = 'r' ... marker = '>' ... elif t == 1: ... c = 'g' ... marker = 'o' ... elif t == 2: ... c = 'b' ... marker = 'x' ... plt.scatter(features[target == t,0], ... features[target == t,1], ... marker=marker, ... c=c) Building our first classification model If the goal is to separate the three types of flowers, we can immediately make a few suggestions just by looking at the data. For example, petal length seems to be able to separate Iris Setosa from the other two flower species on its own. We can write a little bit of code to discover where the cut-off is: >>> # We use NumPy fancy indexing to get an array of strings: >>> labels = target_names[target] >>> # The petal length is the feature at position 2 >>> plength = features[:, 2] >>> # Build an array of booleans: >>> is_setosa = (labels == 'setosa') >>> # This is the important step: >>> max_setosa =plength[is_setosa].max() >>> min_non_setosa = plength[~is_setosa].min() >>> print('Maximum of setosa: {0}.'.format(max_setosa)) Maximum of setosa: 1.9. >>> print('Minimum of others: {0}.'.format(min_non_setosa)) Minimum of others: 3.0. Therefore, we can build a simple model: if the petal length is smaller than 2, then this is an Iris Setosa flower; otherwise it is either Iris Virginica or Iris Versicolor. This is our first model and it works very well in that it separates Iris Setosa flowers from the other two species without making any mistakes. In this case, we did not actually do any machine learning. Instead, we looked at the data ourselves, looking for a separation between the classes. Machine learning happens when we write code to look for this separation automatically. The problem of recognizing Iris Setosa apart from the other two species was very easy. However, we cannot immediately see what the best threshold is for distinguishing Iris Virginica from Iris Versicolor. We can even see that we will never achieve perfect separation with these features. We could, however, look for the best possible separation, the separation that makes the fewest mistakes. For this, we will perform a little computation. We first select only the non-Setosa features and labels: >>> # ~ is the boolean negation operator >>> features = features[~is_setosa] >>> labels = labels[~is_setosa] >>> # Build a new target variable, is_virginica >>> is_virginica = (labels == 'virginica') Here we are heavily using NumPy operations on arrays. The is_setosa array is a Boolean array and we use it to select a subset of the other two arrays, features and labels. Finally, we build a new boolean array, virginica, by using an equality comparison on labels. Now, we run a loop over all possible features and thresholds to see which one results in better accuracy. Accuracy is simply the fraction of examples that the model classifies correctly. >>> # Initialize best_acc to impossibly low value >>> best_acc = -1.0 >>> for fi in range(features.shape[1]): ... # We are going to test all possible thresholds ... thresh = features[:,fi] ... for t in thresh: ... # Get the vector for feature `fi` ... feature_i = features[:, fi] ... # apply threshold `t` ... pred = (feature_i > t) ... acc = (pred == is_virginica).mean() ... rev_acc = (pred == ~is_virginica).mean() ... if rev_acc > acc: ... reverse = True ... acc = rev_acc ... else: ... reverse = False ... ... if acc > best_acc: ... best_acc = acc ... best_fi = fi ... best_t = t ... best_reverse = reverse We need to test two types of thresholds for each feature and value: we test a greater than threshold and the reverse comparison. This is why we need the rev_acc variable in the preceding code; it holds the accuracy of reversing the comparison. The last few lines select the best model. First, we compare the predictions, pred, with the actual labels, is_virginica. The little trick of computing the mean of the comparisons gives us the fraction of correct results, the accuracy. At the end of the for loop, all the possible thresholds for all the possible features have been tested, and the variables best_fi, best_t, and best_reverse hold our model. This is all the information we need to be able to classify a new, unknown object, that is, to assign a class to it. The following code implements exactly this method: def is_virginica_test(fi, t, reverse, example): "Apply threshold model to a new example" test = example[fi] > t if reverse: test = not test return test What does this model look like? If we run the code on the whole data, the model that is identified as the best makes decisions by splitting on the petal width. One way to gain intuition about how this works is to visualize the decision boundary. That is, we can see which feature values will result in one decision versus the other and exactly where the boundary is. In the following screenshot, we see two regions: one is white and the other is shaded in grey. Any datapoint that falls on the white region will be classified as Iris Virginica, while any point that falls on the shaded side will be classified as Iris Versicolor. In a threshold model, the decision boundary will always be a line that is parallel to one of the axes. The plot in the preceding screenshot shows the decision boundary and the two regions where points are classified as either white or grey. It also shows (as a dashed line) an alternative threshold, which will achieve exactly the same accuracy. Our method chose the first threshold it saw, but that was an arbitrary choice. Evaluation – holding out data and cross-validation The model discussed in the previous section is a simple model; it achieves 94 percent accuracy of the whole data. However, this evaluation may be overly optimistic. We used the data to define what the threshold will be, and then we used the same data to evaluate the model. Of course, the model will perform better than anything else we tried on this dataset. The reasoning is circular. What we really want to do is estimate the ability of the model to generalize to new instances. We should measure its performance in instances that the algorithm has not seen at training. Therefore, we are going to do a more rigorous evaluation and use held-out data. For this, we are going to break up the data into two groups: on one group, we'll train the model, and on the other, we'll test the one we held out of training. The full code, which is an adaptation of the code presented earlier, is available on the online support repository. Its output is as follows: Training accuracy was 96.0%. Testing accuracy was 90.0% (N = 50). The result on the training data (which is a subset of the whole data) is apparently even better than before. However, what is important to note is that the result in the testing data is lower than that of the training error. While this may surprise an inexperienced machine learner, it is expected that testing accuracy will be lower than the training accuracy. To see why, look back at the plot that showed the decision boundary. Consider what would have happened if some of the examples close to the boundary were not there or that one of them between the two lines was missing. It is easy to imagine that the boundary will then move a little bit to the right or to the left so as to put them on the wrong side of the border. The accuracy on the training data, the training accuracy, is almost always an overly optimistic estimate of how well your algorithm is doing. We should always measure and report the testing accuracy, which is the accuracy on a collection of examples that were not used for training. These concepts will become more and more important as the models become more complex. In this example, the difference between the accuracy measured on training data and on testing data is not very large. When using a complex model, it is possible to get 100 percent accuracy in training and do no better than random guessing on testing! One possible problem with what we did previously, which was to hold out data from training, is that we only used half the data for training. Perhaps it would have been better to use more training data. On the other hand, if we then leave too little data for testing, the error estimation is performed on a very small number of examples. Ideally, we would like to use all of the data for training and all of the data for testing as well, which is impossible. We can achieve a good approximation of this impossible ideal by a method called cross-validation. One simple form of cross-validation is leave-one-out cross-validation. We will take an example out of the training data, learn a model without this example, and then test whether the model classifies this example correctly. This process is then repeated for all the elements in the dataset. The following code implements exactly this type of cross-validation: >>> correct = 0.0 >>> for ei in range(len(features)): # select all but the one at position `ei`: training = np.ones(len(features), bool) training[ei] = False testing = ~training model = fit_model(features[training], is_virginica[training]) predictions = predict(model, features[testing]) correct += np.sum(predictions == is_virginica[testing]) >>> acc = correct/float(len(features)) >>> print('Accuracy: {0:.1%}'.format(acc)) Accuracy: 87.0% At the end of this loop, we will have tested a series of models on all the examples and have obtained a final average result. When using cross-validation, there is no circularity problem because each example was tested on a model which was built without taking that datapoint into account. Therefore, the cross-validated estimate is a reliable estimate of how well the models would generalize to new data. The major problem with leave-one-out cross-validation is that we are now forced to perform many times more work. In fact, you must learn a whole new model for each and every example and this cost will increase as our dataset grows. We can get most of the benefits of leave-one-out at a fraction of the cost by using x-fold cross-validation, where x stands for a small number. For example, to perform five-fold cross-validation, we break up the data into five groups, so-called five folds. Then you learn five models: each time you will leave one fold out of the training data. The resulting code will be similar to the code given earlier in this section, but we leave 20 percent of the data out instead of just one element. We test each of these models on the left-out fold and average the results. The preceding figure illustrates this process for five blocks: the dataset is split into five pieces. For each fold, you hold out one of the blocks for testing and train on the other four. You can use any number of folds you wish. There is a trade-off between computational efficiency (the more folds, the more computation is necessary) and accurate results (the more folds, the closer you are to using the whole of the data for training). Five folds is often a good compromise. This corresponds to training with 80 percent of your data, which should already be close to what you will get from using all the data. If you have little data, you can even consider using 10 or 20 folds. In the extreme case, if you have as many folds as datapoints, you are simply performing leave-one-out cross-validation. On the other hand, if computation time is an issue and you have more data, 2 or 3 folds may be the more appropriate choice. When generating the folds, you need to be careful to keep them balanced. For example, if all of the examples in one fold come from the same class, then the results will not be representative. We will not go into the details of how to do this, because the machine learning package scikit-learn will handle them for you. We have now generated several models instead of just one. So, "What final model do we return and use for new data?" The simplest solution is now to train a single overall model on all your training data. The cross-validation loop gives you an estimate of how well this model should generalize. A cross-validation schedule allows you to use all your data to estimate whether your methods are doing well. At the end of the cross-validation loop, you can then use all your data to train a final model. Although it was not properly recognized when machine learning was starting out as a field, nowadays, it is seen as a very bad sign to even discuss the training accuracy of a classification system. This is because the results can be very misleading and even just presenting them marks you as a newbie in machine learning. We always want to measure and compare either the error on a held-out dataset or the error estimated using a cross-validation scheme. Building more complex classifiers In the previous section, we used a very simple model: a threshold on a single feature. Are there other types of systems? Yes, of course! Many others. To think of the problem at a higher abstraction level, "What makes up a classification model?" We can break it up into three parts: The structure of the model: How exactly will a model make decisions? In this case, the decision depended solely on whether a given feature was above or below a certain threshold value. This is too simplistic for all but the simplest problems. The search procedure: How do we find the model we need to use? In our case, we tried every possible combination of feature and threshold. You can easily imagine that as models get more complex and datasets get larger, it rapidly becomes impossible to attempt all combinations and we are forced to use approximate solutions. In other cases, we need to use advanced optimization methods to find a good solution (fortunately, scikit-learn already implements these for you, so using them is easy even if the code behind them is very advanced). The gain or loss function: How do we decide which of the possibilities tested should be returned? Rarely do we find the perfect solution, the model that never makes any mistakes, so we need to decide which one to use. We used accuracy, but sometimes it will be better to optimize so that the model makes fewer errors of a specific kind. For example, in spam filtering, it may be worse to delete a good e-mail than to erroneously let a bad e-mail through. In that case, we may want to choose a model that is conservative in throwing out e-mails rather than the one that just makes the fewest mistakes overall. We can discuss these issues in terms of gain (which we want to maximize) or loss (which we want to minimize). They are equivalent, but sometimes one is more convenient than the other. We can play around with these three aspects of classifiers and get different systems. A simple threshold is one of the simplest models available in machine learning libraries and only works well when the problem is very simple, such as with the Iris dataset. In the next section, we will tackle a more difficult classification task that requires a more complex structure. In our case, we optimized the threshold to minimize the number of errors. Alternatively, we might have different loss functions. It might be that one type of error is much costlier than the other. In a medical setting, false negatives and false positives are not equivalent. A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests to confirm or unnecessary treatment (which can still have costs, including side effects from the treatment, but are often less serious than missing a diagnostic). Therefore, depending on the exact setting, different trade-offs can make sense. At one extreme, if the disease is fatal and the treatment is cheap with very few negative side-effects, then you want to minimize false negatives as much as you can. What the gain/cost function should be is always dependent on the exact problem you are working on. When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes, achieving the highest accuracy. However, if some mistakes are costlier than others, it might be better to accept a lower overall accuracy to minimize the overall costs. A more complex dataset and a more complex classifier We will now look at a slightly more complex dataset. This will motivate the introduction of a new classification algorithm and a few other ideas. Learning about the Seeds dataset We now look at another agricultural dataset, which is still small, but already too large to plot exhaustively on a page as we did with Iris. This dataset consists of measurements of wheat seeds. There are seven features that are present, which are as follows: area A perimeter P compactness C = 4pA/P² length of kernel width of kernel asymmetry coefficient length of kernel groove There are three classes, corresponding to three wheat varieties: Canadian, Koma, and Rosa. As earlier, the goal is to be able to classify the species based on these morphological measurements. Unlike the Iris dataset, which was collected in the 1930s, this is a very recent dataset and its features were automatically computed from digital images. This is how image pattern recognition can be implemented: you can take images, in digital form, compute a few relevant features from them, and use a generic classification system. For the moment, we will work with the features that are given to us. UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online repository of machine learning datasets (at the time of writing, they list 233 datasets). Both the Iris and the Seeds dataset used in this article were taken from there. The repository is available online at http://archive.ics.uci.edu/ml/. Features and feature engineering One interesting aspect of these features is that the compactness feature is not actually a new measurement, but a function of the previous two features, area and perimeter. It is often very useful to derive new combined features. Trying to create new features is generally called feature engineering. It is sometimes seen as less glamorous than algorithms, but it often matters more for performance (a simple algorithm on well-chosen features will perform better than a fancy algorithm on not-so-good features). In this case, the original researchers computed the compactness, which is a typical feature for shapes. It is also sometimes called roundness. This feature will have the same value for two kernels, one of which is twice as big as the other one, but with the same shape. However, it will have different values for kernels that are very round (when the feature is close to one) when compared to kernels that are elongated (when the feature is closer to zero). The goals of a good feature are to simultaneously vary with what matters (the desired output) and be invariant with what does not. For example, compactness does not vary with size, but varies with the shape. In practice, it might be hard to achieve both objectives perfectly, but we want to approximate this ideal. You will need to use background knowledge to design good features. Fortunately, for many problem domains, there is already a vast literature of possible features and feature-types that you can build upon. For images, all of the previously mentioned features are typical and computer vision libraries will compute them for you. In text-based problems too, there are standard solutions that you can mix and match. When possible, you should use your knowledge of the problem to design a specific feature or to select which ones from the literature are more applicable to the data at hand. Even before you have data, you must decide which data is worthwhile to collect. Then, you hand all your features to the machine to evaluate and compute the best classifier. A natural question is whether we can select good features automatically. This problem is known as feature selection. There are many methods that have been proposed for this problem, but in practice very simple ideas work best. For the small problems we are currently exploring, it does not make sense to use feature selection, but if you had thousands of features, then throwing out most of them might make the rest of the process much faster. Nearest neighbor classification For use with this dataset, we will introduce a new classifier: the nearest neighbor classifier. The nearest neighbor classifier is very simple. When classifying a new element, it looks at the training data for the object that is closest to it, its nearest neighbor. Then, it returns its label as the answer. Notice that this model performs perfectly on its training data! For each point, its closest neighbor is itself, and so its label matches perfectly (unless two examples with different labels have exactly the same feature values, which will indicate that the features you are using are not very descriptive). Therefore, it is essential to test the classification using a cross-validation protocol. The nearest neighbor method can be generalized to look not at a single neighbor, but to multiple ones and take a vote amongst the neighbors. This makes the method more robust to outliers or mislabeled data. Classifying with scikit-learn We have been using handwritten classification code, but Python is a very appropriate language for machine learning because of its excellent libraries. In particular, scikit-learn has become the standard library for many machine learning tasks, including classification. We are going to use its implementation of nearest neighbor classification in this section. The scikit-learn classification API is organized around classifier objects. These objects have the following two essential methods: fit(features, labels): This is the learning step and fits the parameters of the model predict(features): This method can only be called after fit and returns a prediction for one or more inputs Here is how we could use its implementation of k-nearest neighbors for our data. We start by importing the KneighborsClassifier object from the sklearn.neighbors submodule: >>> from sklearn.neighbors import KNeighborsClassifier The scikit-learn module is imported as sklearn (sometimes you will also find that scikit-learn is referred to using this short name instead of the full name). All of the sklearn functionality is in submodules, such as sklearn.neighbors. We can now instantiate a classifier object. In the constructor, we specify the number of neighbors to consider, as follows: >>> classifier = KNeighborsClassifier(n_neighbors=1) If we do not specify the number of neighbors, it defaults to 5, which is often a good choice for classification. We will want to use cross-validation (of course) to look at our data. The scikit-learn module also makes this easy: >>> from sklearn.cross_validation import KFold >>> kf = KFold(len(features), n_folds=5, shuffle=True) >>> # `means` will be a list of mean accuracies (one entry per fold) >>> means = [] >>> for training,testing in kf: ... # We fit a model for this fold, then apply it to the ... # testing data with `predict`: ... classifier.fit(features[training], labels[training]) ... prediction = classifier.predict(features[testing]) ... ... # np.mean on an array of booleans returns fraction ... # of correct decisions for this fold: ... curmean = np.mean(prediction == labels[testing]) ... means.append(curmean) >>> print("Mean accuracy: {:.1%}".format(np.mean(means))) Mean accuracy: 90.5% Using five folds for cross-validation, for this dataset, with this algorithm, we obtain 90.5 percent accuracy. As we discussed in the earlier section, the cross-validation accuracy is lower than the training accuracy, but this is a more credible estimate of the performance of the model. Looking at the decision boundaries We will now examine the decision boundary. In order to plot these on paper, we will simplify and look at only two dimensions. Take a look at the following plot: Canadian examples are shown as diamonds, Koma seeds as circles, and Rosa seeds as triangles. Their respective areas are shown as white, black, and grey. You might be wondering why the regions are so horizontal, almost weirdly so. The problem is that the x axis (area) ranges from 10 to 22, while the y axis (compactness) ranges from 0.75 to 1.0. This means that a small change in x is actually much larger than a small change in y. So, when we compute the distance between points, we are, for the most part, only taking the x axis into account. This is also a good example of why it is a good idea to visualize our data and look for red flags or surprises. If you studied physics (and you remember your lessons), you might have already noticed that we had been summing up lengths, areas, and dimensionless quantities, mixing up our units (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to z-scores. The z-score of a value is how far away from the mean it is, in units of standard deviation. It comes down to this operation: In this formula, f is the old feature value, f' is the normalized feature value, μ is the mean of the feature, and σ is the standard deviation. Both μ and σ are estimated from training data. Independent of what the original values were, after z-scoring, a value of zero corresponds to the training mean, positive values are above the mean, and negative values are below it. The scikit-learn module makes it very easy to use this normalization as a preprocessing step. We are going to use a pipeline of transformations: the first element will do the transformation and the second element will do the classification. We start by importing both the pipeline and the feature scaling classes as follows: >>> from sklearn.pipeline import Pipeline >>> from sklearn.preprocessing import StandardScaler Now, we can combine them. >>> classifier = KNeighborsClassifier(n_neighbors=1) >>> classifier = Pipeline([('norm', StandardScaler()), ('knn', classifier)]) The Pipeline constructor takes a list of pairs (str,clf). Each pair corresponds to a step in the pipeline: the first element is a string naming the step, while the second element is the object that performs the transformation. Advanced usage of the object uses these names to refer to different steps. After normalization, every feature is in the same units (technically, every feature is now dimensionless; it has no units) and we can more confidently mix dimensions. In fact, if we now run our nearest neighbor classifier, we obtain 93 percent accuracy, estimated with the same five-fold cross-validation code shown previously! Look at the decision space again in two dimensions: The boundaries are now different and you can see that both dimensions make a difference for the outcome. In the full dataset, everything is happening on a seven-dimensional space, which is very hard to visualize, but the same principle applies; while a few dimensions are dominant in the original data, after normalization, they are all given the same importance. Binary and multiclass classification The first classifier we used, the threshold classifier, was a simple binary classifier. Its result is either one class or the other, as a point is either above the threshold value or it is not. The second classifier we used, the nearest neighbor classifier, was a natural multiclass classifier, its output can be one of the several classes. It is often simpler to define a simple binary method than the one that works on multiclass problems. However, we can reduce any multiclass problem to a series of binary decisions. This is what we did earlier in the Iris dataset, in a haphazard way: we observed that it was easy to separate one of the initial classes and focused on the other two, reducing the problem to two binary decisions: Is it an Iris Setosa (yes or no)? If not, check whether it is an Iris Virginica (yes or no). Of course, we want to leave this sort of reasoning to the computer. As usual, there are several solutions to this multiclass reduction. The simplest is to use a series of one versus the rest classifiers. For each possible label ℓ, we build a classifier of the type is this ℓ or something else? When applying the rule, exactly one of the classifiers will say yes and we will have our solution. Unfortunately, this does not always happen, so we have to decide how to deal with either multiple positive answers or no positive answers. Alternatively, we can build a classification tree. Split the possible labels into two, and build a classifier that asks, "Should this example go in the left or the right bin?" We can perform this splitting recursively until we obtain a single label. The preceding diagram depicts the tree of reasoning for the Iris dataset. Each diamond is a single binary classifier. It is easy to imagine that we could make this tree larger and encompass more decisions. This means that any classifier that can be used for binary classification can also be adapted to handle any number of classes in a simple way. There are many other possible ways of turning a binary method into a multiclass one. There is no single method that is clearly better in all cases. The scikit-learn module implements several of these methods in the sklearn.multiclass submodule. Some classifiers are binary systems, while many real-life problems are naturally multiclass. Several simple protocols reduce a multiclass problem to a series of binary decisions and allow us to apply the binary models to our multiclass problem. This means methods that are apparently only for binary data can be applied to multiclass data with little extra effort. Summary Classification means generalizing from examples to build a model (that is, a rule that can automatically be applied to new, unclassified objects). It is one of the fundamental tools in machine. In a sense, this was a very theoretical article, as we introduced generic concepts with simple examples. We went over a few operations with the Iris dataset. This is a small dataset. However, it has the advantage that we were able to plot it out and see what we were doing in detail. This is something that will be lost when we move on to problems with many dimensions and many thousands of examples. The intuitions we gained here will all still be valid. You also learned that the training error is a misleading, over-optimistic estimate of how well the model does. We must, instead, evaluate it on testing data that has not been used for training. In order to not waste too many examples in testing, a cross-validation schedule can get us the best of both worlds (at the cost of more computation). We also had a look at the problem of feature engineering. Features are not predefined for you, but choosing and designing features is an integral part of designing a machine learning pipeline. In fact, it is often the area where you can get the most improvements in accuracy, as better data beats fancier methods. Resources for Article: Further resources on this subject: Ridge Regression [article] The Spark programming model [article] Using cross-validation [article]

0
0
17882

Packt

24 Mar 2015

7 min read

Category Theory

Packt

24 Mar 2015

7 min read

In this article written by Dan Mantyla, author of the book Functional Programming in JavaScript, you would get to learn about some of the JavaScript functors such as Maybes, Promises and Lenses. (For more resources related to this topic, see here.) Maybes Maybes allow us to gracefully work with data that might be null and to have defaults. A maybe is a variable that either has some value or it doesn't. And it doesn't matter to the caller. On its own, it might seem like this is not that big a deal. Everybody knows that null-checks are easily accomplished with an if-else statement: if (getUsername() == null ) { username = 'Anonymous') { else { username = getUsername(); } But with functional programming, we're breaking away from the procedural, line-by-line way of doing things and instead working with pipelines of functions and data. If we had to break the chain in the middle just to check if the value existed or not, we would have to create temporary variables and write more code. Maybes are just tools to help us keep the logic flowing through the pipeline. To implement maybes, we'll first need to create some constructors. // the Maybe monad constructor, empty for now var Maybe = function(){}; // the None instance, a wrapper for an object with no value var None = function(){}; None.prototype = Object.create(Maybe.prototype); None.prototype.toString = function(){return 'None';}; // now we can write the `none` function // saves us from having to write `new None()` all the time var none = function(){return new None()}; // and the Just instance, a wrapper for an object with a value var Just = function(x){return this.x = x;}; Just.prototype = Object.create(Maybe.prototype); Just.prototype.toString = function(){return "Just "+this.x;}; var just = function(x) {return new Just(x)}; Finally, we can write the maybe function. It returns a new function that either returns nothing or a maybe. It is a functor. var maybe = function(m){ if (m instanceof None) { return m; } else if (m instanceof Just) { return just(m.x); } else { throw new TypeError("Error: Just or None expected, " + m.toString() + " given."); } } And we can also create a functor generator just like we did with arrays. var maybeOf = function(f){ return function(m) { if (m instanceof None) { return m; } else if (m instanceof Just) { return just(f(m.x)); } else { throw new TypeError("Error: Just or None expected, " + m.toString() + " given."); } } } So Maybe is a monad, maybe is a functor, and maybeOf returns a functor that is already assigned to a morphism. We'll need one more thing before we can move forward. We'll need to add a method to the Maybe monad object that helps us use it more intuitively. Maybe.prototype.orElse = function(y) { if (this instanceof Just) { return this.x; } else { return y; } } In its raw form, maybes can be used directly. maybe(just(123)).x; // Returns 123 maybeOf(plusplus)(just(123)).x; // Returns 124 maybe(plusplus)(none()).orElse('none'); // returns 'none' Anything that returns a method that is then executed is complicated enough to be begging for trouble. So we can make it a little cleaner by calling on our curry() function. maybePlusPlus = maybeOf.curry()(plusplus); maybePlusPlus(just(123)).x; // returns 123 maybePlusPlus(none()).orElse('none'); // returns none But the real power of maybes will become clear when the dirty business of directly calling the none() and just() functions is abstracted. We'll do this with an example object User, that uses maybes for the username. var User = function(){ this.username = none(); // initially set to `none` }; User.prototype.setUsername = function(name) { this.username = just(str(name)); // it's now a `just }; User.prototype.getUsernameMaybe = function() { var usernameMaybe = maybeOf.curry()(str); return usernameMaybe(this.username).orElse('anonymous'); }; var user = new User(); user.getUsernameMaybe(); // Returns 'anonymous' user.setUsername('Laura'); user.getUsernameMaybe(); // Returns 'Laura' And now we have a powerful and safe way to define defaults. Keep this User object in mind because we'll be using it later. Promises The nature of promises is that they remain immune to changing circumstances. - Frank Underwood, House of Cards In functional programming, we're often working with pipelines and data flows: chains of functions where each function produces a data type that is consumed by the next. However, many of these functions are asynchronous: readFile, events, AJAX, and so on. Instead of using a continuation-passing style and deeply nested callbacks, how can we modify the return types of these functions to indicate the result? By wrapping them in promises. Promises are like the functional equivalent of callbacks. Obviously, callbacks are not all that functional because, if more than one function is mutating the same data, then there can be race conditions and bugs. Promises solve that problem. You should use promises to turn this: fs.readFile("file.json", function(err, val) { if( err ) { console.error("unable to read file"); } else { try { val = JSON.parse(val); console.log(val.success); } catch( e ) { console.error("invalid json in file"); } } }); Into the following code snippet: fs.readFileAsync("file.json").then(JSON.parse) .then(function(val) { console.log(val.success); }) .catch(SyntaxError, function(e) { console.error("invalid json in file"); }) .catch(function(e){ console.error("unable to read file") }); The preceding code is from the README for bluebird: a full featured Promises/A+ implementation with exceptionally good performance. Promises/A+ is a specification for implementing promises in JavaScript. Given its current debate within the JavaScript community, we'll leave the implementations up to the Promises/A+ team, as it is much more complex than maybes. But here's a partial implementation: // the Promise monad var Promise = require('bluebird'); // the promise functor var promise = function(fn, receiver) { return function() { var slice = Array.prototype.slice, args = slice.call(arguments, 0, fn.length - 1), promise = new Promise(); args.push(function() { var results = slice.call(arguments), error = results.shift(); if (error) promise.reject(error); else promise.resolve.apply(promise, results); }); fn.apply(receiver, args); return promise; }; }; Now we can use the promise() functor to transform functions that take callbacks into functions that return promises. var files = ['a.json', 'b.json', 'c.json']; readFileAsync = promise(fs.readFile); var data = files .map(function(f){ readFileAsync(f).then(JSON.parse) }) .reduce(function(a,b){ return $.extend({}, a, b) }); Lenses Another reason why programmers really like monads is that they make writing libraries very easy. To explore this, let's extend our User object with more functions for getting and setting values but, instead of using getters and setters, we'll use lenses. Lenses are first-class getters and setters. They allow us to not just get and set variables, but also to run functions over it. But instead of mutating the data, they clone and return the new data modified by the function. They force data to be immutable, which is great for security and consistency as well for libraries. They're great for elegant code no matter what the application, so long as the performance-hit of introducing additional array copies is not a critical issue. Before we write the lens() function, let's look at how it works. var first = lens( function (a) { return arr(a)[0]; }, // get function (a, b) { return [b].concat(arr(a).slice(1)); } // set ); first([1, 2, 3]); // outputs 1 first.set([1, 2, 3], 5); // outputs [5, 2, 3] function tenTimes(x) { return x * 10 } first.modify(tenTimes, [1,2,3]); // outputs [10,2,3] And here's how the lens() function works. It returns a function with get, set and mod defined. The lens() function itself is a functor. var lens = fuction(get, set) { var f = function (a) {return get(a)}; f.get = function (a) {return get(a)}; f.set = set; f.mod = function (f, a) {return set(a, f(get(a)))}; return f; }; Let's try an example. We'll extend our User object from the previous example. // userName :: User -> str var userName = lens( function (u) {return u.getUsernameMaybe()}, // get function (u, v) { // set u.setUsername(v); return u.getUsernameMaybe(); } ); var bob = new User(); bob.setUsername('Bob'); userName.get(bob); // returns 'Bob' userName.set(bob, 'Bobby'); //return 'Bobby' userName.get(bob); // returns 'Bobby' userName.mod(strToUpper, bob); // returns 'BOBBY' strToUpper.compose(userName.set)(bob, 'robert'); // returns 'ROBERT' userName.get(bob); // returns 'robert' Summary In this article, we got to learn some of the different functors that are used in JavaScript such as Maybes, Promises and Lenses along with their corresponding examples, thus making it easy for us to understand the concept clearly. Resources for Article: Further resources on this subject: Alfresco Web Scrpits [article] Implementing Stacks using JavaScript [article] Creating a basic JavaScript plugin [article]

0
0
3556

Packt

24 Mar 2015

15 min read

REST – What You Didn't Know

Packt

24 Mar 2015

15 min read

Nowadays, topics such as cloud computing and mobile device service feeds, and other data sources being powered by cutting-edge, scalable, stateless, and modern technologies such as RESTful web services, leave the impression that REST has been invented recently. Well, to be honest, it is definitely not! In fact, REST was defined at the end of the 20th century. This article by Valentin Bojinov, author of the book RESTful Web API Design with Node.js, will walk you through REST's history and will teach you how REST couples with the HTTP protocol. You will look at the five key principles that need to be considered while turning an HTTP application into a RESTful-service-enabled application. You will also look at the differences between RESTful and SOAP-based services. Finally, you will learn how to utilize already existing infrastructure for your benefit. In this article, we will cover the following topics: A brief history of REST REST with HTTP RESTful versus SOAP-based services Taking advantage of existing infrastructure (For more resources related to this topic, see here.) A brief history of REST Let's look at a time when the madness around REST made everybody talk restlessly about it! This happened back in 1999, when a request for comments was submitted to the Internet Engineering Task Force (IETF: http://www.ietf.org/) via RFC 2616: "Hypertext Transfer Protocol - HTTP/1.1." One of its authors, Roy Fielding, later defined a set of principles built around the HTTP and URI standards. This gave birth to REST as we know it today. Let's look at the key principles around the HTTP and URI standards, sticking to which will make your HTTP application a RESTful-service-enabled application: Everything is a resource Each resource is identifiable by a unique identifier (URI) Use the standard HTTP methods Resources can have multiple representation Communicate statelessly Principle 1 – everything is a resource To understand this principle, one must conceive the idea of representing data by a specific format and not by a physical file. Each piece of data available on the Internet has a format that could be described by a content type. For example, JPEG Images; MPEG videos; html, xml, and text documents; and binary data are all resources with the following content types: image/jpeg, video/mpeg, text/html, text/xml, and application/octet-stream. Principle 2 – each resource is identifiable by a unique identifier Since the Internet contains so many different resources, they all should be accessible via URIs and should be identified uniquely. Furthermore, the URIs can be in a human-readable format (frankly I do believe they should be), despite the fact that their consumers are more likely to be software programmers rather than ordinary humans. The URI keeps the data self-descriptive and eases further development on it. In addition, using human-readable URIs helps you to reduce the risk of logical errors in your programs to a minimum. Here are a few sample examples of such URIs: http://www.mydatastore.com/images/vacation/2014/summer http://www.mydatastore.com/videos/vacation/2014/winter http://www.mydatastore.com/data/documents/balance?format=xml http://www.mydatastore.com/data/archives/2014 These human-readable URIs expose different types of resources in a straightforward manner. In the example, it is quite clear that the resource types are as follows: Images Videos XML documents Some kinds of binary archive documents Principle 3 – use the standard HTTP methods The native HTTP protocol (RFC 2616) defines eight actions, also known as verbs: GET POST PUT DELETE HEAD OPTIONS TRACE CONNECT The first four of them feel just natural in the context of resources, especially when defining actions for resource data manipulation. Let's make a parallel with relative SQL databases where the native language for data manipulation is CRUD (short for Create, Read, Update, and Delete) originating from the different types of SQL statements: INSERT, SELECT, UPDATE and DELETE respectively. In the same manner, if you apply the REST principles correctly, the HTTP verbs should be used as shown here: HTTP verb Action Response status code GET Request an existing resource "200 OK" if the resource exists, "404 Not Found" if it does not exist, and "500 Internal Server Error" for other errors PUT Create or update a resource "201 CREATED" if a new resource is created, "200 OK" if updated, and "500 Internal Server Error" for other errors POST Update an existing resource "200 OK" if the resource has been updated successfully, "404 Not Found" if the resource to be updated does not exist, and "500 Internal Server Error" for other errors DELETE Delete a resource "200 OK" if the resource has been deleted successfully, "404 Not Found" if the resource to be deleted does not exist, and "500 Internal Server Error" for other errors There is an exception in the usage of the verbs, however. I just mentioned that POST is used to create a resource. For instance, when a resource has to be created under a specific URI, then PUT is the appropriate request: PUT /data/documents/balance/22082014 HTTP/1.1 Content-Type: text/xml Host: www.mydatastore.com <?xml version="1.0" encoding="utf-8"?> <balance date="22082014"> <Item>Sample item</Item> <price currency="EUR">100</price> </balance> HTTP/1.1 201 Created Content-Type: text/xml Location: /data/documents/balance/22082014 However, in your application you may want to leave it up to the server REST application to decide where to place the newly created resource, and thus create it under an appropriate but still unknown or non-existing location. For instance, in our example, we might want the server to create the date part of the URI based on the current date. In such cases, it is perfectly fine to use the POST verb to the main resource URI and let the server respond with the location of the newly created resource: POST /data/documents/balance HTTP/1.1Content-Type: text/xmlHost: www.mydatastore.com<?xml version="1.0" encoding="utf-8"?><balance date="22082014"><Item>Sample item</Item><price currency="EUR">100</price></balance>HTTP/1.1 201 CreatedContent-Type: text/xmlLocation: /data/documents/balance Principle 4 – resources can have multiple representations A key feature of a resource is that they may be represented in a different form than the one it is stored. Thus, it can be requested or posted in different representations. As long as the specified format is supported, the REST-enabled endpoint should use it. In the preceding example, we posted an xml representation of a balance, but if the server supported the JSON format, the following request would have been valid as well: POST /data/documents/balance HTTP/1.1Content-Type: application/jsonHost: www.mydatastore.com{"balance": {"date": ""22082014"","Item": "Sample item","price": {"-currency": "EUR","#text": "100"}}}HTTP/1.1 201 CreatedContent-Type: application/jsonLocation: /data/documents/balance Principle 5 – communicate statelessly Resource manipulation operations through HTTP requests should always be considered atomic. All modifications of a resource should be carried out within an HTTP request in isolation. After the request execution, the resource is left in a final state, which implicitly means that partial resource updates are not supported. You should always send the complete state of the resource. Back to the balance example, updating the price field of a given balance would mean posting a complete JSON document that contains all of the balance data, including the updated price field. Posting only the updated price is not stateless, as it implies that the application is aware that the resource has a price field, that is, it knows its state. Another requirement for your RESTful application is to be stateless; the fact that once deployed in a production environment, it is likely that incoming requests are served by a load balancer, ensuring scalability and high availability. Once exposed via a load balancer, the idea of keeping your application state at server side gets compromised. This doesn't mean that you are not allowed to keep the state of your application. It just means that you should keep it in a RESTful way. For example, keep a part of the state within the URI. The statelessness of your RESTful API isolates the caller against changes at the server side. Thus, the caller is not expected to communicate with the same server in consecutive requests. This allows easy application of changes within the server infrastructure, such as adding or removing nodes. Remember that it is your responsibility to keep your RESTful APIs stateless, as the consumers of the API would expect it to be. Now that you know that REST is around 15 years old, a sensible question would be, "why has it become so popular just quite recently?" My answer to the question is that we as humans usually reject simple, straightforward approaches, and most of the time, we prefer spending more time on turning complex solutions into even more complex and sophisticated solutions. Take classical SOAP web services for example. Their various WS-* specifications are so many and sometimes loosely defined in order to make different solutions from different vendors interoperable. The WS-* specifications need to be unified by another specification, WS-BasicProfile. This mechanism defines extra interoperability rules in order to ensure that all WS-* specifications in SOAP-based web services transported over HTTP provide different means of transporting binary data. This is again described in other sets of specifications such as SOAP with Attachment References (SwaRef) and Message Transmission Optimisation Mechanism (MTOM), mainly because the initial idea of the web service was to execute business logic and return its response remotely, not to transport large amounts of data. Well, I personally think that when it comes to data transfer, things should not be that complex. This is where REST comes into place by introducing the concept of resources and standard means to manipulate them. The REST goals Now that we've covered the main REST principles, let's dive deeper into what can be achieved when they are followed: Separation of the representation and the resource Visibility Reliability Scalability Performance Separation of the representation and the resource A resource is just a set of information, and as defined by principle 4, it can have multiple representations. However, the state of the resource is atomic. It is up to the caller to specify the content-type header of the http request, and then it is up to the server application to handle the representation accordingly and return the appropriate HTTP status code: HTTP 200 OK in the case of success HTTP 400 Bad request if a unsupported content type is requested, or for any other invalid request HTTP 500 Internal Server error when something unexpected happens during the request processing For instance, let's assume that at the server side, we have balance resources stored in an XML file. We can have an API that allows a consumer to request the resource in various formats, such as application/json, application/zip, application/octet-stream, and so on. It would be up to the API itself to load the requested resource, transform it into the requested type (for example, json or xml), and either use zip to compress it or directly flush it to the HTTP response output. It is the Accept HTTP header that specifies the expected representation of the response data. So, if we want to request our balance data inserted in the previous section in XML format, the following request should be executed: GET /data/balance/22082014 HTTP/1.1Host: my-computer-hostnameAccept: text/xmlHTTP/1.1 200 OKContent-Type: text/xmlContent-Length: 140<?xml version="1.0" encoding="utf-8"?><balance date="22082014"><Item>Sample item</Item><price currency="EUR">100</price></balance> To request the same balance in JSON format, the Accept header needs to be set to application/json: GET /data/balance/22082014 HTTP/1.1Host: my-computer-hostnameAccept: application/jsonHTTP/1.1 200 OKContent-Type: application/jsonContent-Length: 120{"balance": {"date": "22082014","Item": "Sample item","price": {"-currency": "EUR","#text": "100"}}} Visibility REST is designed to be visible and simple. Visibility of the service means that every aspect of it should self-descriptive and should follow the natural HTTP language according to principles 3, 4, and 5. Visibility in the context of the outer world would mean that monitoring applications would be interested only in the HTTP communication between the REST service and the caller. Since the requests and responses are stateless and atomic, nothing more is needed to flow the behavior of the application and to understand whether anything has gone wrong. Remember that caching reduces the visibility of you restful applications and should be avoided. Reliability Before talking about reliability, we need to define which HTTP methods are safe and which are idempotent in the REST context. So let's first define what safe and idempotent methods are: An HTTP method is considered to be safe provided that when requested, it does not modify or cause any side effects on the state of the resource An HTTP method is considered to be idempotent if its response is always the same, no matter how many times it is requested The following table lists shows you which HTTP method is safe and which is idempotent: HTTP Method Safe Idempotent GET Yes Yes POST No No PUT No Yes DELETE No Yes Scalability and performance So far, I have often stressed on the importance of having stateless implementation and stateless behavior for a RESTful web application. The World Wide Web is an enormous universe, containing a huge amount of data and a few times more users eager to get that data. Its evolution has brought about the requirement that applications should scale easily as their load increases. Scaling applications that have a state is hardly possible, especially when zero or close-to-zero downtime is needed. That's why being stateless is crucial for any application that needs to scale. In the best-case scenario, scaling your application would require you to put another piece of hardware for a load balancer. There would be no need for the different nodes to sync between each other, as they should not care about the state at all. Scalability is all about serving all your clients in an acceptable amount of time. Its main idea is keep your application running and to prevent Denial of Service (DoS) caused by a huge amount of incoming requests. Scalability should not be confused with performance of an application. Performance is measured by the time needed for a single request to be processed, not by the total number of requests that the application can handle. The asynchronous non-blocking architecture and event-driven design of Node.js make it a logical choice for implementing a well-scalable application that performs well. Working with WADL If you are familiar with SOAP web services, you may have heard of the Web Service Definition Language (WSDL). It is an XML description of the interface of the service. It is mandatory for a SOAP web service to be described by such a WSDL definition. Similar to SOAP web services, RESTful services also offer a description language named WADL. WADL stands for Web Application Definition Language. Unlike WSDL for SOAP web services, a WADL description of a RESTful service is optional, that is, consuming the service has nothing to do with its description. Here is a sample part of a WADL file that describes the GET operation of our balance service: <application ><grammer><include href="balance.xsd"/><include href="error.xsd"/></grammer><resources base="http://localhost:8080/data/balance/"><resource path="{date}"><method name="GET"><request><param name="date" type="xsd:string" style="template"/></request><response status="200"><representation mediaType="application/xml"element="service:balance"/><representation mediaType="application/json" /></response><response status="404"><representation mediaType="application/xml"element="service:balance"/></response></method></resource></resources></application> This extract of a WADL file shows how application-exposing resources are described. Basically, each resource must be a part of an application. The resource provides the URI where it is located with the base attribute, and describes each of its supported HTTP methods in a method. Additionally, an optional doc element can be used at resource and application to provide additional documentation about the service and its operations. Though WADL is optional, it significantly reduces the efforts of discovering RESTful services. Taking advantage of the existing infrastructure The best part of developing and distributing RESTful applications is that the infrastructure needed is already out there waiting restlessly for you. As RESTful applications use the existing web space heavily, you need to do nothing more than following the REST principles when developing. In addition, there are a plenty of libraries available out there for any platform, and I do mean any given platform. This eases development of RESTful applications, so you just need to choose the preferred platform for you and start developing. Summary In this article, you learned about the history of REST, and we made a slight comparison between RESTful services and classical SOAP Web services. We looked at the five key principles that would turn our web application into a REST-enabled application, and finally took a look at how RESTful services are described and how we can simplify the discovery of the services we develop. Now that you know the REST basics, we are ready to dive into the Node.js way of implementing RESTful services. Resources for Article: Further resources on this subject: Creating a RESTful API [Article] So, what is Node.js? [Article] CreateJS – Performing Animation and Transforming Function [Article]

0
0
1393

How-To Tutorials

article-image-spritekit-framework-and-physics-simulation

Packt

24 Mar 2015

15 min read

SpriteKit Framework and Physics Simulation

Packt

24 Mar 2015

15 min read

In this article by Bhanu Birani, author of the book iOS Game Programming Cookbook, you will learn about the SpriteKit game framework and about the physics simulation. (For more resources related to this topic, see here.) Getting started with the SpriteKit game framework With the release of iOS 7.0, Apple has introduced its own native 2D game framework called SpriteKit. SpriteKit is a great 2D game engine, which has support for sprite, animations, filters, masking, and most important is the physics engine to provide a real-world simulation for the game. Apple provides a sample game to get started with the SpriteKit called Adventure Game. The download URL for this example project is http://bit.ly/Rqaeda. This sample project provides a glimpse of the capability of this framework. However, the project is complicated to understand and for learning you just want to make something simple. To have a deeper understanding of SpriteKit-based games, we will be building a bunch of mini games in this book. Getting ready To get started with iOS game development, you have the following prerequisites for SpriteKit: You will need the Xcode 5.x The targeted device family should be iOS 7.0+ You should be running OS X 10.8.X or later If all the above requisites are fulfilled, then you are ready to go with the iOS game development. So let's start with game development using iOS native game framework. How to do it... Let's start building the AntKilling game. Perform the following steps to create your new SpriteKit project: Start your Xcode. Navigate to File | New | Project.... Then from the prompt window, navigate to iOS | Application | SpriteKit Game and click on Next. Fill all the project details in the prompt window and provide AntKilling as the project name with your Organization Name, device as iPhone, and Class Prefix as AK. Click on Next. Select a location on the drive to save the project and click on Create. Then build the sample project to check the output of the sample project. Once you build and run the project with the play button, you should see the following on your device: How it works... The following are the observations of the starter project: As you have seen, the sample project of SpriteKit plays a label with a background color. SpriteKit works on the concept of scenes, which can be understood as the layers or the screens of the game. There can be multiple scenes working at the same time; for example, there can be a gameplay scene, hud scene, and the score scene running at the same time in the game. Now we can look into the project for more detail arrangements of the starter project. The following are the observations: In the main directory, you already have one scene created by default called AKMyScene. Now click on AKMyScene.m to explore the code to add the label on the screen. You should see something similar to the following screenshot: Now we will be updating this file with our code to create our AntKilling game in the next sections. We have to fulfill a few prerequisites to get started with the code, such as locking the orientation to landscape as we want a landscape orientation game. To change the orientation of the game, navigate to AntKilling project settings | TARGETS | General. You should see something similar to the following screenshot: Now in the General tab, uncheck Portrait from the device orientation so that the final settings should look similar to the following screenshot: Now build and run the project. You should be able to see the app running in landscape orientation. The bottom-right corner of the screen shows the number of nodes with the frame rate. Introduction to physics simulation We all like games that have realistic effects and actions. In this article we will learn about the ways to make our games more realistic. Have you ever wondered how to provide realistic effect to game objects? It is physics that provides a realistic effect to the games and their characters. In this article, we will learn how to use physics in our games. While developing the game using SpriteKit, you will need to change the world of your game frequently. The world is the main object in the game that holds all the other game objects and physics simulations. We can also update the gravity of the gaming world according to our need. The default world gravity is 9.8, which is also the earth's gravity, World gravity makes all bodies fall down to the ground as soon as they are created. More about SpriteKit can be explored using the following link: https://developer.apple.com/library/ios/documentation/GraphicsAnimation/Conceptual/SpriteKit_PG/Physics/Physics.html Getting ready The first task is to create the world and then add bodies to it, which can interact according to the principles of physics. You can create game objects in the form of sprites and associate physics bodies to them. You can also set various properties of the object to specify its behavior. How to do it... In this section, we will learn about the basic components that are used to develop games. We will also learn how to set game configurations, including the world settings such as gravity and boundary. The initial step is to apply the gravity to the scene. Every scene has a physics world associated with it. We can update the gravity of the physics world in our scene using the following line of code: self.physicsWorld.gravity = CGVectorMake(0.0f, 0.0f); Currently we have set the gravity of the scene to 0, which means the bodies will be in a state of free fall. They will not experience any force due to gravity in the world. In several games we also need to set a boundary to the games. Usually, the bounds of the view can serve as the bounds for our physics world. The following code will help us to set up the boundary for our game, which will be as per the bounds of our game scene: // 1 Create a physics body that borders the screenSKPhysicsBody* gameBorderBody = [SKPhysicsBody bodyWithEdgeLoopFromRect:self.frame];// 2 Set physicsBody of scene to gameBorderBodyself.physicsBody = gameBorderBody;// 3 Set the friction of that physicsBody to 0self.physicsBody.friction = 0.0f; In the first line of code we are initializing a SKPhysicsBody object. This object is used to add the physics simulation to any SKSpriteNode. We have created the gameBorderBody as a rectangle with the dimensions equal to the current scene frame. Then we assign that physics object to the physicsBody of our current scene (every SKSpriteNode object has the physicsBody property through which we can associate physics bodies to any node). After this we update the physicsBody.friction. This line of code updates the friction property of our world. The friction property defines the friction value of one physics body with another physics body. Here we have set this to 0, in order to make the objects move freely, without slowing down. Every game object is inherited from the SKSpriteNode class, which allows the physics body to hold on to the node. Let us take an example and create a game object using the following code: // 1SKSpriteNode* gameObject = [SKSpriteNode spriteNodeWithImageNamed: @"object.png"];gameObject.name = @"game_object";gameObject.position = CGPointMake(self.frame.size.width/3, self.frame.size.height/3);[self addChild:gameObject]; // 2gameObject.physicsBody = [SKPhysicsBody bodyWithCircleOfRadius:gameObject.frame.size.width/2];// 3gameObject.physicsBody.friction = 0.0f; We are already familiar with the first few lines of code wherein we are creating the sprite reference and then adding it to the scene. Now in the next line of code, we are associating a physics body with that sprite. We are initializing the circular physics body with radius and associating it with the sprite object. Then we can update various other properties of the physics body such as friction, restitution, linear damping, and so on. The physics body properties also allow you to apply force. To apply force you need to provide the direction where you want to apply force. [gameObject.physicsBody applyForce:CGVectorMake(10.0f, -10.0f)]; In the code we are applying force in the bottom-right corner of the world. To provide the direction coordinates we have used CGVectorMake, which accepts the vector coordinates of the physics world. You can also apply impulse instead of force. Impulse can be defined as a force that acts for a specific interval of time and is equal to the change in linear momentum produced over that interval. [gameObject.physicsBody applyImpulse:CGVectorMake(10.0f, -10.0f)]; While creating games, we frequently use static objects. To create a rectangular static object we can use the following code: SKSpriteNode* box = [[SKSpriteNode alloc] initWithImageNamed: @"box.png"];box.name = @"box_object";box.position = CGPointMake(CGRectGetMidX(self.frame), box.frame.size.height * 0.6f);[self addChild:box];box.physicsBody = [SKPhysicsBody bodyWithRectangleOfSize:box.frame.size];box.physicsBody.friction = 0.4f;// make physicsBody staticbox.physicsBody.dynamic = NO; So all the code is the same except one special property, which is dynamic. By default this property is set to YES, which means that all the physics bodies will be dynamic by default and can be converted to static after setting this Boolean to NO. Static bodies do not react to any force or impulse. Simply put, dynamic physics bodies can move while the static physics bodies cannot . Integrating physics engine with games From this section onwards, we will develop a mini game that will have a dynamic moving body and a static body. The basic concept of the game will be to create an infinite bouncing ball with a moving paddle that will be used to give direction to the ball. Getting ready... To develop a mini game using the physics engine, start by creating a new project. Open Xcode and go to File | New | Project and then navigate to iOS | Application | SpriteKit Game. In the pop-up screen, provide the Product Name as PhysicsSimulation, navigate to Devices | iPhone and click on Next as shown in the following screenshot: Click on Next and save the project on your hard drive. Once the project is saved, you should be able to see something similar to the following screenshot: In the project settings page, just uncheck the Portrait from Device Orientation section as we are supporting only landscape mode for this game. Graphics and games cannot be separated for long; you will also need some graphics for this game. Download the graphics folder, drag it and import it into the project. Make sure that the Copy items into destination group's folder (if needed) is checked and then click on Finish button. It should be something similar to the following screenshot: How to do it... Now your project template is ready for a physics-based mini game. We need to update the game template project to get started with code game logic. Take the following steps to integrate the basic physics object in the game. Open the file GameScene.m .This class creates a scene that will be plugged into the games. Remove all code from this class and just add the following function: -(id)initWithSize:(CGSize)size { if (self = [super initWithSize:size]) { SKSpriteNode* background = [SKSpriteNode spriteNodeWithImageNamed:@"bg.png"]; background.position = CGPointMake(self.frame.size.width/2, self.frame.size.height/2); [self addChild:background]; } } This initWithSize method creates an blank scene of the specified size. The code written inside the init function allows you to add the background image at the center of the screen in your game. Now when you compile and run the code, you will observe that the background image is not placed correctly on the scene. To resolve this, open GameViewController.m. Remove all code from this file and add the following function; -(void)viewWillLayoutSubviews { [super viewWillLayoutSubviews]; // Configure the view. SKView * skView = (SKView *)self.view; if (!skView.scene) { skView.showsFPS = YES; skView.showsNodeCount = YES; // Create and configure the scene. GameScene * scene = [GameScene sceneWithSize:skView.bounds.size]; scene.scaleMode = SKSceneScaleModeAspectFill; // Present the scene. [skView presentScene:scene]; }} To ensure that the view hierarchy is properly laid out, we have implemented the viewWillLayoutSubviews method. It does not work perfectly in viewDidLayoutSubviews method because the size of the scene is not known at that time. Now compile and run the app. You should be able to see the background image correctly. It will look something similar to the following screenshot: So now that we have the background image in place, let us add gravity to the world. Open GameScene.m and add the following line of code at the end of the initWithSize method: self.physicsWorld.gravity = CGVectorMake(0.0f, 0.0f); This line of code will set the gravity of the world to 0, which means there will be no gravity. Now as we have removed the gravity to make the object fall freely, it's important to create a boundary around the world, which will hold all the objects of the world and prevent them to go off the screen. Add the following line of code to add the invisible boundary around the screen to hold the physics objects: // 1 Create a physics body that borders the screenSKPhysicsBody* gameborderBody = [SKPhysicsBody bodyWithEdgeLoopFromRect:self.frame];// 2 Set physicsBody of scene to borderBodyself.physicsBody = gameborderBody;// 3 Set the friction of that physicsBody to 0self.physicsBody.friction = 0.0f; In the first line, we are are creating an edge-based physics boundary object, with a screen size frame. This type of a physics body does not have any mass or volume and also remains unaffected by force and impulses. Then we associate the object with the physics body of the scene. In the last line we set the friction of the body to 0, for a seamless interaction between objects and the boundary surface. The final file should look something like the following screenshot: Now we have our surface ready to hold the physics world objects. Let us create a new physics world object using the following line of code: // 1SKSpriteNode* circlularObject = [SKSpriteNode spriteNodeWithImageNamed: @"ball.png"];circlularObject.name = ballCategoryName;circlularObject.position = CGPointMake(self.frame.size.width/3, self.frame.size.height/3);[self addChild:circlularObject]; // 2circlularObject.physicsBody = [SKPhysicsBody bodyWithCircleOfRadius:circlularObject.frame.size.width/2];// 3circlularObject.physicsBody.friction = 0.0f;// 4circlularObject.physicsBody.restitution = 1.0f;// 5circlularObject.physicsBody.linearDamping = 0.0f;// 6circlularObject.physicsBody.allowsRotation = NO; Here we have created the sprite and then we have added it to the scene. Then in the later steps we associate the circular physics body with the sprite object. Finally, we alter the properties of that physics body. Now compile and run the application; you should be able to see the circular ball on the screen as shown in screenshot below: The circular ball is added to the screen, but it does nothing. So it's time to add some action in the code. Add the following line of code at the end of the initWithSize method: [circlularObject.physicsBody applyImpulse:CGVectorMake(10.0f, -10.0f)]; This will apply the force on the physics body, which in turn will move the associated ball sprite as well. Now compile and run the project. You should be able to see the ball moving and then collide with the boundary and bounce back, as there is no friction between the boundary and the ball. So now we have the infinite bouncing ball in the game. How it works… There are several properties used while creating physics bodies to define their behavior in the physics world. The following is a detailed description of the properties used in the preceding code: Restitution property defines the bounciness of an object. Setting the restitution to 1.0f, means that the ball collision will be perfectly elastic with any object. This means that the ball will bounce back with a force equal to the impact. Linear Damping property allows the simulation of fluid or air friction. This is accomplished by reducing the linear velocity of the body. In our case, we do not want the ball to slow down while moving and hence we have set the restitution to 0.0f. There's more… You can read about all these properties in detail at Apple's developer documentation: https://developer.apple.com/library/IOs/documentation/SpriteKit/Reference/SKPhysicsBody_Ref/index.html. Summary In this article, you have learned about the SpriteKit game framework, how to create a simple game using SpriteKit framework, physics simulation, and also how to integrate physics engine with games. Resources for Article: Further resources on this subject: Code Sharing Between iOS and Android [article] Linking OpenCV to an iOS project [article] Interface Designing for Games in iOS [article]

0
0
3025

Packt

24 Mar 2015

5 min read

Introducing GameMaker

Packt

24 Mar 2015

5 min read

In this article by Nathan Auckett, author of the book GameMaker Essentials, you will learn what GameMaker is all about, who made it, what it is used for, and more. You will then also be learning how to install GameMaker on your computer that is ready for use. (For more resources related to this topic, see here.) In this article, we will cover the following topics: Understanding GameMaker Installing GameMaker: Studio What is this article about? Understanding GameMaker Before getting started with GameMaker, it is best to know exactly what it is and what it's designed to do. GameMaker is a 2D game creation software by YoYo Games. It was designed to allow anyone to easily develop games without having to learn complex programming languages such as C++ through the use of its drag and drop functionality. The drag and drop functionality allows the user to create games by visually organizing icons on screen, which represent actions and statements that will occur during the game. GameMaker also has a built-in programming language called GameMaker Language, or GML for short. GML allows users to type out code to be run during their game. All drag and drop actions are actually made up of this GML code. GameMaker is primarily designed for 2D games, and most of its features and functions are designed for 2D game creation. However, GameMaker does have the ability to create 3D games and has a number of functions dedicated to this. GameMaker: Studio There are a number of different versions of GameMaker available, most of which are unsupported because they are outdated; however, support can still be found in the GameMaker Community Forums. GameMaker: Studio is the first version of GameMaker after GameMaker HTML5 to allow users to create games and export them for use on multiple devices and operating systems including PC, Mac, Linux, and Android, on both mobile and desktop versions. GameMaker: Studio is designed to allow one code base (GML) to run on any device with minimal changes to the base code. Users are able to export their games to run on any supported device or system such as HTML5 without changing any code to make things work. GameMaker: Studio was also the first version available for download and use through the Steam marketplace. YoYo Games took advantage of the Steam workshop and allowed Steam-based users to post and share their creations through the service. GameMaker: Studio is sold in a number of different versions, which include several enhanced features and capabilities as the price gets higher. The standard version is free to download and use. However, it lacks some advanced features included in higher versions and only allows for exporting to the Windows desktop. The professional version is the second cheapest from the standard version. It includes all features, but only has the Windows desktop and Windows app exports. Other exports can be purchased at an extra cost ranging from $99.99 to $300. The master version is the most expensive of all the options. It comes with every feature and every export, including all future export modules in version 1.x. If you already own exports in the professional version, you can get the prices of those exports taken off the price of the master version. Installing GameMaker: Studio Installing GameMaker is performed much like any other program. In this case, we will be installing GameMaker: Studio as this is the most up-to-date version at this point. You can find the download at the YoYo Games website, https://www.yoyogames.com/. From the site, you can pick the free version or purchase one of the others. All the installations are basically the same. Once the installer is downloaded, we are ready to install GameMaker: Studio. This is just like installing any other program. Just run the file, and then follow the on-screen instructions to accept the license agreement, choose an install location, and install the software. On the first run, you may see a progress bar appear at the top left of your screen. This is just GameMaker running its first time setup. It will also do this during the update process as YoYo Games releases new features. Once it is done, you should see a welcome screen and will be prompted to enter your registration code. The key should be e-mailed to you when you make an account during the purchase/download process. Enter this key and your copy of GameMaker: Studio should be registered. You may be prompted to restart GameMaker at this time. Close GameMaker and re-open it and you should see the welcome screen and be able to choose from a number of options on it: What is this article about? We now have GameMaker: Studio installed and are ready to get started with it. In this article, we will be covering the essential things to know about GameMaker: Studio. This includes everything from drag and drop actions to programming in GameMaker using GameMaker Language (GML). You will learn about how things in GameMaker are structured, and how to organize resources to keep things as clean as possible. Summary In this article, we looked into what GameMaker actually is and learned that there are different versions available. We also looked at the different types of GameMaker: Studio available for download and purchase. We then learned how to install GameMaker: Studio, which is the final step in getting ready to learn the essential skills and starting to make our very own games. Resources for Article: Further resources on this subject: Getting Started – An Introduction to GML [Article] Animating a Game Character [Article] Why should I make cross-platform games? [Article]

0
0
8915

article-image-understanding-core-data-concepts

Packt

23 Mar 2015

10 min read

Understanding Core Data concepts

Packt

23 Mar 2015

10 min read

0
0
5249

Cassandra Architecture

Subscribing to a report

Fun with Sprites – Sky Defense

Geolocating photos on the map

WLAN Encryption Flaws

Cross-browser Tests using Selenium WebDriver

Prerequisites

Handling Image and Video Files

Creating Custom Reports and Notifications for vSphere

Classifying with Real-world Examples

Trending Topics

Category Theory

REST – What You Didn't Know

SpriteKit Framework and Physics Simulation

Introducing GameMaker

Understanding Core Data concepts

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access