Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7018 Articles
article-image-writing-consumers
Packt
04 Mar 2015
20 min read
Save for later

Writing Consumers

Packt
04 Mar 2015
20 min read
This article by Nishant Garg, the author of the book Learning Apache Kafka Second Edition, focuses on the details of Writing Consumers. Consumers are the applications that consume the messages published by Kafka producers and process the data extracted from them. Like producers, consumers can also be different in nature, such as applications doing real-time or near real-time analysis, applications with NoSQL or data warehousing solutions, backend services, consumers for Hadoop, or other subscriber-based solutions. These consumers can also be implemented in different languages such as Java, C, and Python. (For more resources related to this topic, see here.) In this article, we will focus on the following topics: The Kafka Consumer API Java-based Kafka consumers Java-based Kafka consumers consuming partitioned messages At the end of the article, we will explore some of the important properties that can be set for a Kafka consumer. So, let's start. The preceding diagram explains the high-level working of the Kafka consumer when consuming the messages. The consumer subscribes to the message consumption from a specific topic on the Kafka broker. The consumer then issues a fetch request to the lead broker to consume the message partition by specifying the message offset (the beginning position of the message offset). Therefore, the Kafka consumer works in the pull model and always pulls all available messages after its current position in the Kafka log (the Kafka internal data representation). While subscribing, the consumer connects to any of the live nodes and requests metadata about the leaders for the partitions of a topic. This allows the consumer to communicate directly with the lead broker receiving the messages. Kafka topics are divided into a set of ordered partitions and each partition is consumed by one consumer only. Once a partition is consumed, the consumer changes the message offset to the next partition to be consumed. This represents the states about what has been consumed and also provides the flexibility of deliberately rewinding back to an old offset and re-consuming the partition. In the next few sections, we will discuss the API provided by Kafka for writing Java-based custom consumers. All the Kafka classes referred to in this article are actually written in Scala. Kafka consumer APIs Kafka provides two types of API for Java consumers: High-level API Low-level API The high-level consumer API The high-level consumer API is used when only data is needed and the handling of message offsets is not required. This API hides broker details from the consumer and allows effortless communication with the Kafka cluster by providing an abstraction over the low-level implementation. The high-level consumer stores the last offset (the position within the message partition where the consumer left off consuming the message), read from a specific partition in Zookeeper. This offset is stored based on the consumer group name provided to Kafka at the beginning of the process. The consumer group name is unique and global across the Kafka cluster and any new consumers with an in-use consumer group name may cause ambiguous behavior in the system. When a new process is started with the existing consumer group name, Kafka triggers a rebalance between the new and existing process threads for the consumer group. After the rebalance, some messages that are intended for a new process may go to an old process, causing unexpected results. To avoid this ambiguous behavior, any existing consumers should be shut down before starting new consumers for an existing consumer group name. The following are the classes that are imported to write Java-based basic consumers using the high-level consumer API for a Kafka cluster: ConsumerConnector: Kafka provides the ConsumerConnector interface (interface ConsumerConnector) that is further implemented by the ZookeeperConsumerConnector class (kafka.javaapi.consumer.ZookeeperConsumerConnector). This class is responsible for all the interaction a consumer has with ZooKeeper. The following is the class diagram for the ConsumerConnector class: KafkaStream: Objects of the kafka.consumer.KafkaStream class are returned by the createMessageStreams call from the ConsumerConnector implementation. This list of the KafkaStream objects is returned for each topic, which can further create an iterator over messages in the stream. The following is the Scala-based class declaration: class KafkaStream[K,V](private val queue:                       BlockingQueue[FetchedDataChunk],                       consumerTimeoutMs: Int,                       private val keyDecoder: Decoder[K],                       private val valueDecoder: Decoder[V],                       val clientId: String) Here, the parameters K and V specify the type for the partition key and message value, respectively. In the create call from the ConsumerConnector class, clients can specify the number of desired streams, where each stream object is used for single-threaded processing. These stream objects may represent the merging of multiple unique partitions. ConsumerConfig: The kafka.consumer.ConsumerConfig class encapsulates the property values required for establishing the connection with ZooKeeper, such as ZooKeeper URL, ZooKeeper session timeout, and ZooKeeper sink time. It also contains the property values required by the consumer such as group ID and so on. A high-level API-based working consumer example is discussed after the next section. The low-level consumer API The high-level API does not allow consumers to control interactions with brokers. Also known as "simple consumer API", the low-level consumer API is stateless and provides fine grained control over the communication between Kafka broker and the consumer. It allows consumers to set the message offset with every request raised to the broker and maintains the metadata at the consumer's end. This API can be used by both online as well as offline consumers such as Hadoop. These types of consumers can also perform multiple reads for the same message or manage transactions to ensure the message is consumed only once. Compared to the high-level consumer API, developers need to put in extra effort to gain low-level control within consumers by keeping track of offsets, figuring out the lead broker for the topic and partition, handling lead broker changes, and so on. In the low-level consumer API, consumers first query the live broker to find out the details about the lead broker. Information about the live broker can be passed on to the consumers either using a properties file or from the command line. The topicsMetadata() method of the kafka.javaapi.TopicMetadataResponse class is used to find out metadata about the topic of interest from the lead broker. For message partition reading, the kafka.api.OffsetRequest class defines two constants: EarliestTime and LatestTime, to find the beginning of the data in the logs and the new messages stream. These constants also help consumers to track which messages are already read. The main class used within the low-level consumer API is the SimpleConsumer (kafka.javaapi.consumer.SimpleConsumer) class. The following is the class diagram for the SimpleConsumer class:   A simple consumer class provides a connection to the lead broker for fetching the messages from the topic and methods to get the topic metadata and the list of offsets. A few more important classes for building different request objects are FetchRequest (kafka.api.FetchRequest), OffsetRequest (kafka.javaapi.OffsetRequest), OffsetFetchRequest (kafka.javaapi.OffsetFetchRequest), OffsetCommitRequest (kafka.javaapi.OffsetCommitRequest), and TopicMetadataRequest (kafka.javaapi.TopicMetadataRequest). All the examples in this article are based on the high-level consumer API. For examples based on the low-level consumer API, refer tohttps://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example. Simple Java consumers Now we will start writing a single-threaded simple Java consumer developed using the high-level consumer API for consuming the messages from a topic. This SimpleHLConsumer class is used to fetch a message from a specific topic and consume it, assuming that there is a single partition within the topic. Importing classes As a first step, we need to import the following classes: import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector; Defining properties As a next step, we need to define properties for making a connection with Zookeeper and pass these properties to the Kafka consumer using the following code: Properties props = new Properties(); props.put("zookeeper.connect", "localhost:2181"); props.put("group.id", "testgroup"); props.put("zookeeper.session.timeout.ms", "500"); props.put("zookeeper.sync.time.ms", "250"); props.put("auto.commit.interval.ms", "1000"); new ConsumerConfig(props); Now let us see the major properties mentioned in the code: zookeeper.connect: This property specifies the ZooKeeper <node:port> connection detail that is used to find the Zookeeper running instance in the cluster. In the Kafka cluster, Zookeeper is used to store offsets of messages consumed for a specific topic and partition by this consumer group. group.id: This property specifies the name for the consumer group shared by all the consumers within the group. This is also the process name used by Zookeeper to store offsets. zookeeper.session.timeout.ms: This property specifies the Zookeeper session timeout in milliseconds and represents the amount of time Kafka will wait for Zookeeper to respond to a request before giving up and continuing to consume messages. zookeeper.sync.time.ms: This property specifies the ZooKeeper sync time in milliseconds between the ZooKeeper leader and the followers. auto.commit.interval.ms: This property defines the frequency in milliseconds at which consumer offsets get committed to Zookeeper. Reading messages from a topic and printing them As a final step, we need to read the message using the following code: Map<String, Integer> topicMap = new HashMap<String, Integer>(); // 1 represents the single thread topicCount.put(topic, new Integer(1));   Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap = consumer.createMessageStreams(topicMap);   // Get the list of message streams for each topic, using the default decoder. List<KafkaStream<byte[], byte[]>>streamList =  consumerStreamsMap.get(topic);   for (final KafkaStream <byte[], byte[]> stream : streamList) { ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();   while (consumerIte.hasNext())     System.out.println("Message from Single Topic :: "     + new String(consumerIte.next().message())); } So the complete program will look like the following code: package kafka.examples.ch5;   import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties;   import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector;   public class SimpleHLConsumer {   private final ConsumerConnector consumer;   private final String topic;     public SimpleHLConsumer(String zookeeper, String groupId, String topic) {     consumer = kafka.consumer.Consumer         .createJavaConsumerConnector(createConsumerConfig(zookeeper,             groupId));     this.topic = topic;   }     private static ConsumerConfig createConsumerConfig(String zookeeper,         String groupId) {     Properties props = new Properties();     props.put("zookeeper.connect", zookeeper);     props.put("group.id", groupId);     props.put("zookeeper.session.timeout.ms", "500");     props.put("zookeeper.sync.time.ms", "250");     props.put("auto.commit.interval.ms", "1000");       return new ConsumerConfig(props);     }     public void testConsumer() {       Map<String, Integer> topicMap = new HashMap<String, Integer>();       // Define single thread for topic     topicMap.put(topic, new Integer(1));       Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap =         consumer.createMessageStreams(topicMap);       List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap         .get(topic);       for (final KafkaStream<byte[], byte[]> stream : streamList) {       ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();       while (consumerIte.hasNext())         System.out.println("Message from Single Topic :: "           + new String(consumerIte.next().message()));     }     if (consumer != null)       consumer.shutdown();   }     public static void main(String[] args) {       String zooKeeper = args[0];     String groupId = args[1];     String topic = args[2];     SimpleHLConsumer simpleHLConsumer = new SimpleHLConsumer(           zooKeeper, groupId, topic);     simpleHLConsumer.testConsumer();   }   } Before running this, make sure you have created the topic kafkatopic from the command line: [root@localhost kafka_2.9.2-0.8.1.1]#bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic kafkatopic Before compiling and running a Java-based Kafka program in the console, make sure you download the slf4j-1.7.7.tar.gz file from http://www.slf4j.org/download.html and copy slf4j-log4j12-1.7.7.jar contained within slf4j-1.7.7.tar.gz to the /opt/kafka_2.9.2-0.8.1.1/libs directory. Also add all the libraries available in /opt/kafka_2.9.2-0.8.1.1/libs to the classpath using the following commands: [root@localhost kafka_2.9.2-0.8.1.1]# export KAFKA_LIB=/opt/kafka_2.9.2-0.8.1.1/libs [root@localhost kafka_2.9.2-0.8.1.1]# export CLASSPATH=.:$KAFKA_LIB/jopt-simple-3.2.jar:$KAFKA_LIB/kafka_2.9.2-0.8.1.1.jar:$KAFKA_LIB/log4j-1.2.15.jar:$KAFKA_LIB/metrics-core-2.2.0.jar:$KAFKA_LIB/scala-library-2.9.2.jar:$KAFKA_LIB/slf4j-api-1.7.2.jar:$KAFKA_LIB/slf4j-log4j12-1.7.7.jar:$KAFKA_LIB/snappy-java-1.0.5.jar:$KAFKA_LIB/zkclient-0.3.jar:$KAFKA_LIB/zookeeper-3.3.4.jar Multithreaded Java consumers The previous example is a very basic example of a consumer that consumes messages from a single broker with no explicit partitioning of messages within the topic. Let's jump to the next level and write another program that consumes messages from multiple partitions connecting to single/multiple topics. A multithreaded, high-level, consumer-API-based design is usually based on the number of partitions in the topic and follows a one-to-one mapping approach between the thread and the partitions within the topic. For example, if four partitions are defined for any topic, as a best practice, only four threads should be initiated with the consumer application to read the data; otherwise, some conflicting behavior, such as threads never receiving a message or a thread receiving messages from multiple partitions, may occur. Also, receiving multiple messages will not guarantee that the messages will be placed in order. For example, a thread may receive two messages from the first partition and three from the second partition, then three more from the first partition, followed by some more from the first partition, even if the second partition has data available. Let's move further on. Importing classes As a first step, we need to import the following classes: import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector; Defining properties As the next step, we need to define properties for making a connection with Zookeeper and pass these properties to the Kafka consumer using the following code: Properties props = new Properties(); props.put("zookeeper.connect", "localhost:2181"); props.put("group.id", "testgroup"); props.put("zookeeper.session.timeout.ms", "500"); props.put("zookeeper.sync.time.ms", "250"); props.put("auto.commit.interval.ms", "1000"); new ConsumerConfig(props); The preceding properties have already been discussed in the previous example. For more details on Kafka consumer properties, refer to the last section of this article. Reading the message from threads and printing it The only difference in this section from the previous section is that we first create a thread pool and get the Kafka streams associated with each thread within the thread pool, as shown in the following code: // Define thread count for each topic topicMap.put(topic, new Integer(threadCount));   // Here we have used a single topic but we can also add // multiple topics to topicCount MAP Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap            = consumer.createMessageStreams(topicMap);   List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap.get(topic);   // Launching the thread pool executor = Executors.newFixedThreadPool(threadCount); The complete program listing for the multithread Kafka consumer based on the Kafka high-level consumer API is as follows: package kafka.examples.ch5;   import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors;   import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector;   public class MultiThreadHLConsumer {     private ExecutorService executor;   private final ConsumerConnector consumer;   private final String topic;     public MultiThreadHLConsumer(String zookeeper, String groupId, String topic) {     consumer = kafka.consumer.Consumer         .createJavaConsumerConnector(createConsumerConfig(zookeeper, groupId));     this.topic = topic;   }     private static ConsumerConfig createConsumerConfig(String zookeeper,         String groupId) {     Properties props = new Properties();     props.put("zookeeper.connect", zookeeper);     props.put("group.id", groupId);     props.put("zookeeper.session.timeout.ms", "500");     props.put("zookeeper.sync.time.ms", "250");     props.put("auto.commit.interval.ms", "1000");       return new ConsumerConfig(props);     }     public void shutdown() {     if (consumer != null)       consumer.shutdown();     if (executor != null)       executor.shutdown();   }     public void testMultiThreadConsumer(int threadCount) {       Map<String, Integer> topicMap = new HashMap<String, Integer>();       // Define thread count for each topic     topicMap.put(topic, new Integer(threadCount));       // Here we have used a single topic but we can also add     // multiple topics to topicCount MAP     Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap =         consumer.createMessageStreams(topicMap);       List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap         .get(topic);       // Launching the thread pool     executor = Executors.newFixedThreadPool(threadCount);       // Creating an object messages consumption     int count = 0;     for (final KafkaStream<byte[], byte[]> stream : streamList) {       final int threadNumber = count;       executor.submit(new Runnable() {       public void run() {       ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();       while (consumerIte.hasNext())         System.out.println("Thread Number " + threadNumber + ": "         + new String(consumerIte.next().message()));         System.out.println("Shutting down Thread Number: " +         threadNumber);         }       });       count++;     }     if (consumer != null)       consumer.shutdown();     if (executor != null)       executor.shutdown();   }     public static void main(String[] args) {       String zooKeeper = args[0];     String groupId = args[1];     String topic = args[2];     int threadCount = Integer.parseInt(args[3]);     MultiThreadHLConsumer multiThreadHLConsumer =         new MultiThreadHLConsumer(zooKeeper, groupId, topic);     multiThreadHLConsumer.testMultiThreadConsumer(threadCount);     try {       Thread.sleep(10000);     } catch (InterruptedException ie) {       }     multiThreadHLConsumer.shutdown();     } } Compile the preceding program, and before running it, read the following tip. Before we run this program, we need to make sure our cluster is running as a multi-broker cluster (comprising either single or multiple nodes).  Once your multi-broker cluster is up, create a topic with four partitions and set the replication factor to 2 before running this program using the following command: [root@localhost kafka-0.8]# bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic kafkatopic --partitions 4 --replication-factor 2 The Kafka consumer property list The following lists of a few important properties that can be configured for high-level, consumer-API-based Kafka consumers. The Scala class kafka.consumer.ConsumerConfig provides implementation-level details for consumer configurations. For a complete list, visit http://kafka.apache.org/documentation.html#consumerconfigs. Property name Description Default value group.id This property defines a unique identity for the set of consumers within the same consumer group.   consumer.id This property is specified for the Kafka consumer and generated automatically if not defined. null zookeeper.connect This property specifies the Zookeeper connection string, < hostname:port/chroot/path>. Kafka uses Zookeeper to store offsets of messages consumed for a specific topic and partition by the consumer group. /chroot/path defines the data location in a global zookeeper namespace.   client.id The client.id value is specified by the Kafka client with each request and is used to identify the client making the requests. ${group.id} zookeeper.session.timeout.ms This property defines the time (in milliseconds) for a Kafka consumer to wait for a Zookeeper pulse before it is declared dead and rebalance is initiated. 6000 zookeeper.connection.timeout.ms This value defines the maximum waiting time (in milliseconds) for the client to establish a connection with ZooKeeper. 6000 zookeeper.sync.time.ms This property defines the time it takes to sync a Zookeeper follower with the Zookeeper leader (in milliseconds). 2000 auto.commit.enable This property enables a periodical commit of message offsets to the Zookeeper that are already fetched by the consumer. In the event of consumer failures, these committed offsets are used as a starting position by the new consumers. true auto.commit.interval.ms This property defines the frequency (in milliseconds) for the consumed offsets to get committed to ZooKeeper. 60 * 1000 auto.offset.reset This property defines the offset value if an initial offset is available in Zookeeper or the offset is out of range. Possible values are: largest: reset to largest offset smallest: reset to smallest offset anything else: throw an exception largest consumer.timeout.ms This property throws an exception to the consumer if no message is available for consumption after the specified interval. -1 Summary In this article, we have learned how to write basic consumers and learned about some advanced levels of Java consumers that consume messages from partitions. Resources for Article: Further resources on this subject: Introducing Kafka? [article] Introduction To Apache Zookeeper [article] Creating Apache Jmeter™ Test Workbench [article]
Read more
  • 0
  • 0
  • 3687

article-image-prototyping-arduino-projects-using-python
Packt
04 Mar 2015
18 min read
Save for later

Prototyping Arduino Projects using Python

Packt
04 Mar 2015
18 min read
In this article by Pratik Desai, the author of Python Programming for Arduino, we will cover the following topics: Working with pyFirmata methods Servomotor – moving the motor to a certain angle The Button() widget – interfacing GUI with Arduino and LEDs (For more resources related to this topic, see here.) Working with pyFirmata methods The pyFirmata package provides useful methods to bridge the gap between Python and Arduino's Firmata protocol. Although these methods are described with specific examples, you can use them in various different ways. This section also provides detailed description of a few additional methods. Setting up the Arduino board To set up your Arduino board in a Python program using pyFirmata, you need to specifically follow the steps that we have written down. We have distributed the entire code that is required for the setup process into small code snippets in each step. While writing your code, you will have to carefully use the code snippets that are appropriate for your application. You can always refer to the example Python files containing the complete code. Before we go ahead, let's first make sure that your Arduino board is equipped with the latest version of the StandardFirmata program and is connected to your computer: Depending upon the Arduino board that is being utilized, start by importing the appropriate pyFirmata classes to the Python code. Currently, the inbuilt pyFirmata classes only support the Arduino Uno and Arduino Mega boards: from pyfirmata import Arduino In case of Arduino Mega, use the following line of code: from pyfirmata import ArduinoMega Before we start executing any methods that is associated with handling pins, it is required to properly set the Arduino board. To perform this task, we have to first identify the USB port to which the Arduino board is connected and assign this location to a variable in the form of a string object. For Mac OS X, the port string should approximately look like this: port = '/dev/cu.usbmodemfa1331' For Windows, use the following string structure: port = 'COM3' In the case of the Linux operating system, use the following line of code: port = '/dev/ttyACM0' The port's location might be different according to your computer configuration. You can identify the correct location of your Arduino USB port by using the Arduino IDE. Once you have imported the Arduino class and assigned the port to a variable object, it's time to engage Arduino with pyFirmata and associate this relationship to another variable: board = Arduino(port) Similarly, for Arduino Mega, use this: board = ArduinoMega(port) The synchronization between the Arduino board and pyFirmata requires some time. Adding sleep time between the preceding assignment and the next set of instructions can help to avoid any issues that are related to serial port buffering. The easiest way to add sleep time is to use the inbuilt Python method, sleep(time): from time import sleep sleep(1) The sleep() method takes seconds as the parameter and a floating-point number can be used to provide the specific sleep time. For example, for 200 milliseconds, it will be sleep(0.2). At this point, you have successfully synchronized your Arduino Uno or Arduino Mega board to the computer using pyFirmata. What if you want to use a different variant (other than Arduino Uno or ArduinoMega) of the Arduino board? Any board layout in pyFirmata is defined as a dictionary object. The following is a sample of the dictionary object for the Arduino board: arduino = {     'digital' : tuple(x for x in range(14)),     'analog' : tuple(x for x in range(6)),     'pwm' : (3, 5, 6, 9, 10, 11),     'use_ports' : True,     'disabled' : (0, 1) # Rx, Tx, Crystal     } For your variant of the Arduino board, you have to first create a custom dictionary object. To create this object, you need to know the hardware layout of your board. For example, an Arduino Nano board has a layout similar to a regular Arduino board, but it has eight instead of six analog ports. Therefore, the preceding dictionary object can be customized as follows: nano = {     'digital' : tuple(x for x in range(14)),     'analog' : tuple(x for x in range(8)),     'pwm' : (3, 5, 6, 9, 10, 11),     'use_ports' : True,     'disabled' : (0, 1) # Rx, Tx, Crystal     } As you have already synchronized the Arduino board earlier, modify the layout of the board using the setup_layout(layout) method: board.setup_layout(nano) This command will modify the default layout of the synchronized Arduino board to the Arduino Nano layout or any other variant for which you have customized the dictionary object. Configuring Arduino pins Once your Arduino board is synchronized, it is time to configure the digital and analog pins that are going to be used as part of your program. Arduino board has digital I/O pins and analog input pins that can be utilized to perform various operations. As we already know, some of these digital pins are also capable of PWM. The direct method Now before we start writing or reading any data to these pins, we have to first assign modes to these pins. In the Arduino sketch-based, we use the pinMode function, that is, pinMode(11, INPUT) for this operation. Similarly, in pyFirmata, this assignment operation is performed using the mode method on the board object as shown in the following code snippet: from pyfirmata import Arduino from pyfirmata import INPUT, OUTPUT, PWM   # Setting up Arduino board port = '/dev/cu.usbmodemfa1331' board = Arduino(port)   # Assigning modes to digital pins board.digital[13].mode = OUTPUT board.analog[0].mode = INPUT The pyFirmata library includes classes for the INPUT and OUTPUT modes, which are required to be imported before you utilized them. The preceding example shows the delegation of digital pin 13 as an output and the analog pin 0 as an input. The mode method is performed on the variable assigned to the configured Arduino board using the digital[] and analog[] array index assignment. The pyFirmata library also supports additional modes such as PWM and SERVO. The PWM mode is used to get analog results from digital pins, while SERVO mode helps a digital pin to set the angle of the shaft between 0 to 180 degrees. If you are using any of these modes, import their appropriate classes from the pyFirmata library. Once these classes are imported from the pyFirmata package, the modes for the appropriate pins can be assigned using the following lines of code: board.digital[3].mode = PWM board.digital[10].mode = SERVO Assigning pin modes The direct method of configuring pin is mostly used for a single line of execution calls. In a project containing a large code and complex logic, it is convenient to assign a pin with its role to a variable object. With an assignment like this, you can later utilize the assigned variable throughout the program for various actions, instead of calling the direct method every time you need to use that pin. In pyFirmata, this assignment can be performed using the get_pin(pin_def) method: from pyfirmata import Arduino port = '/dev/cu.usbmodemfa1311' board = Arduino(port)   # pin mode assignment ledPin = board.get_pin('d:13:o') The get_pin() method lets you assign pin modes using the pin_def string parameter, 'd:13:o'. The three components of pin_def are pin type, pin number, and pin mode separated by a colon (:) operator. The pin types ( analog and digital) are denoted with a and d respectively. The get_pin() method supports three modes, i for input, o for output, and p for PWM. In the previous code sample, 'd:13:o' specifies the digital pin 13 as an output. In another example, if you want to set up the analog pin 1 as an input, the parameter string will be 'a:1:i'. Working with pins As you have configured your Arduino pins, it's time to start performing actions using them. Two different types of methods are supported while working with pins: reporting methods and I/O operation methods. Reporting data When pins get configured in a program as analog input pins, they start sending input values to the serial port. If the program does not utilize this incoming data, the data starts getting buffered at the serial port and quickly overflows. The pyFirmata library provides the reporting and iterator methods to deal with this phenomenon. The enable_reporting() method is used to set the input pin to start reporting. This method needs to be utilized before performing a reading operation on the pin: board.analog[3].enable_reporting() Once the reading operation is complete, the pin can be set to disable reporting: board.analog[3].disable_reporting() In the preceding example, we assumed that you have already set up the Arduino board and configured the mode of the analog pin 3 as INPUT. The pyFirmata library also provides the Iterator() class to read and handle data over the serial port. While working with analog pins, we recommend that you start an iterator thread in the main loop to update the pin value to the latest one. If the iterator method is not used, the buffered data might overflow your serial port. This class is defined in the util module of the pyFirmata package and needs to be imported before it is utilized in the code: from pyfirmata import Arduino, util # Setting up the Arduino board port = 'COM3' board = Arduino(port) sleep(5)   # Start Iterator to avoid serial overflow it = util.Iterator(board) it.start() Manual operations As we have configured the Arduino pins to suitable modes and their reporting characteristic, we can start monitoring them. The pyFirmata provides the write() and read() methods for the configured pins. The write() method The write() method is used to write a value to the pin. If the pin's mode is set to OUTPUT, the value parameter is a Boolean, that is, 0 or 1: board.digital[pin].mode = OUTPUT board.digital[pin].write(1) If you have used an alternative method of assigning the pin's mode, you can use the write() method as follows: ledPin = board.get_pin('d:13:o') ledPin.write(1) In case of the PWM signal, the Arduino accepts a value between 0 and 255 that represents the length of the duty cycle between 0 and 100 percent. The PyFiramta library provides a simplified method to deal with the PWM values as instead of values between 0 and 255, as you can just provide a float value between 0 and 1.0. For example, if you want a 50 percent duty cycle (2.5V analog value), you can specify 0.5 with the write() method. The pyFirmata library will take care of the translation and send the appropriate value, that is, 127, to the Arduino board via the Firmata protocol: board.digital[pin].mode = PWM board.digital[pin].write(0.5) Similarly, for the indirect method of assignment, you can use code similar to the following one: pwmPin = board.get_pin('d:13:p') pwmPin.write(0.5) If you are using the SERVO mode, you need to provide the value in degrees between 0 and 180. Unfortunately, the SERVO mode is only applicable for direct assignment of the pins and will be available in future for indirect assignments: board.digital[pin].mode = SERVO board.digital[pin].write(90) The read() method The read() method provides an output value at the specified Arduino pin. When the Iterator() class is being used, the value received using this method is the latest updated value at the serial port. When you read a digital pin, you can get only one of the two inputs, HIGH or LOW, which will translate to 1 or 0 in Python: board.digital[pin].read() The analog pins of Arduino linearly translate the input voltages between 0 and +5V to 0 and 1023. However, in pyFirmata, the values between 0 and +5V are linearly translated into the float values of 0 and 1.0. For example, if the voltage at the analog pin is 1V, an Arduino program will measure a value somewhere around 204, but you will receive the float value as 0.2 while using pyFirmata's read() method in Python. Servomotor – moving the motor to certain angle Servomotors are widely used electronic components in applications such as pan-tilt camera control, robotics arm, mobile robot movements, and so on where precise movement of the motor shaft is required. This precise control of the motor shaft is possible because of the position sensing decoder, which is an integral part of the servomotor assembly. A standard servomotor allows the angle of the shaft to be set between 0 and 180 degrees. The pyFirmata provides the SERVO mode that can be implemented on every digital pin. This prototyping exercise provides a template and guidelines to interface a servomotor with Python. Connections Typically, a servomotor has wires that are color-coded red, black and yellow, respectively to connect with the power, ground, and signal of the Arduino board. Connect the power and the ground of the servomotor to the 5V and the ground of the Arduino board. As displayed in the following diagram, connect the yellow signal wire to the digital pin 13: If you want to use any other digital pin, make sure that you change the pin number in the Python program in the next section. Once you have made the appropriate connections, let's move on to the Python program. The Python code The Python file consisting this code is named servoCustomAngle.py and is located in the code bundle of this book, which can be downloaded from https://www.packtpub.com/books/content/support/19610. Open this file in your Python editor. Like other examples, the starting section of the program contains the code to import the libraries and set up the Arduino board: from pyfirmata import Arduino, SERVO from time import sleep   # Setting up the Arduino board port = 'COM5' board = Arduino(port) # Need to give some time to pyFirmata and Arduino to synchronize sleep(5) Now that you have Python ready to communicate with the Arduino board, let's configure the digital pin that is going to be used to connect the servomotor to the Arduino board. We will complete this task by setting the mode of pin 13 to SERVO: # Set mode of the pin 13 as SERVO pin = 13 board.digital[pin].mode = SERVO The setServoAngle(pin,angle) custom function takes the pins on which the servomotor is connected and the custom angle as input parameters. This function can be used as a part of various large projects that involve servos: # Custom angle to set Servo motor angle def setServoAngle(pin, angle):   board.digital[pin].write(angle)   sleep(0.015) In the main logic of this template, we want to incrementally move the motor shaft in one direction until it achieves the maximum achievable angle (180 degrees) and then move it back to the original position with the same incremental speed. In the while loop, we will ask the user to provide inputs to continue this routine, which will be captured using the raw_input() function. The user can enter character y to continue this routine or enter any other character to abort the loop: # Testing the function by rotating motor in both direction while True:   for i in range(0, 180):     setServoAngle(pin, i)   for i in range(180, 1, -1):     setServoAngle(pin, i)     # Continue or break the testing process   i = raw_input("Enter 'y' to continue or Enter to quit): ")   if i == 'y':     pass   else:     board.exit()     break While working with all these prototyping examples, we used the direct communication method by using digital and analog pins to connect the sensor with Arduino. Now, let's get familiar with another widely used communication method between Arduino and the sensors. This is called I2C communication. The Button() widget – interfacing GUI with Arduino and LEDs Now that you have had your first hands-on experience in creating a Python graphical interface, let's integrate Arduino with it. Python makes it easy to interface various heterogeneous packages within each other and that is what you are going to do. In the next coding exercise, we will use Tkinter and pyFirmata to make the GUI work with Arduino. In this exercise, we are going to use the Button() widget to control the LEDs interfaced with the Arduino board. Before we jump to the exercises, let's build the circuit that we will need for all upcoming programs. The following is a Fritzing diagram of the circuit where we use two different colored LEDs with pull up resistors. Connect these LEDs to digital pins 10 and 11 on your Arduino Uno board, as displayed in the following diagram: While working with the code provided in this section, you will have to replace the Arduino port that is used to define the board variable according to your operating system. Also, make sure that you provide the correct pin number in the code if you are planning to use any pins other than 10 and 11. For some exercises, you will have to use the PWM pins, so make sure that you have correct pins. You can use the entire code snippet as a Python file and run it. But, this might not be possible in the upcoming exercises due to the length of the program and the complexity involved. For the Button() widget exercise, open the exampleButton.py file. The code contains three main components: pyFirmata and Arduino configurations Tkinter widget definitions for a button The LED blink function that gets executed when you press the button As you can see in the following code snippet, we have first imported libraries and initialized the Arduino board using the pyFirmata methods. For this exercise, we are only going to work with one LED and we have initialized only the ledPin variable for it: import Tkinter import pyfirmata from time import sleep port = '/dev/cu.usbmodemfa1331' board = pyfirmata.Arduino(port) sleep(5) ledPin = board.get_pin('d:11:o') As we are using the pyFirmata library for all the exercises in this article, make sure that you have uploaded the latest version of the standard Firmata sketch on your Arduino board. In the second part of the code, we have initialized the root Tkinter widget as top and provided a title string. We have also fixed the size of this window using the minsize() method. In order to get more familiar with the root widget, you can play around with the minimum and maximum size of the window: top = Tkinter.Tk() top.title("Blink LED using button") top.minsize(300,30) The Button() widget is a standard Tkinter widget that is mostly used to obtain the manual, external input stimulus from the user. Like the Label() widget, the Button() widget can be used to display text or images. Unlike the Label() widget, it can be associated with actions or methods when it is pressed. When the button is pressed, Tkinter executes the methods or commands specified by the command option: startButton = Tkinter.Button(top,                              text="Start",                              command=onStartButtonPress) startButton.pack() In this initialization, the function associated with the button is onStartButtonPress and the "Start" string is displayed as the title of the button. Similarly, the top object specifies the parent or the root widget. Once the button is instantiated, you will need to use the pack() method to make it available in the main window. In the preceding lines of code, the onStartButonPress() function includes the scripts that are required to blink the LEDs and change the state of the button. A button state can have the state as NORMAL, ACTIVE, or DISABLED. If it is not specified, the default state of any button is NORMAL. The ACTIVE and DISABLED states are useful in applications when repeated pressing of the button needs to be avoided. After turning the LED on using the write(1) method, we will add a time delay of 5 seconds using the sleep(5) function before turning it off with the write(0) method: def onStartButtonPress():   startButton.config(state=Tkinter.DISABLED)   ledPin.write(1)   # LED is on for fix amount of time specified below   sleep(5)   ledPin.write(0)   startButton.config(state=Tkinter.ACTIVE) At the end of the program, we will execute the mainloop() method to initiate the Tkinter loop. Until this function is executed, the main window won't appear. To run the code, make appropriate changes to the Arduino board variable and execute the program. The following screenshot with a button and title bar will appear as the output of the program. Clicking on the Start button will turn on the LED on the Arduino board for the specified time delay. Meanwhile, when the LED is on, you will not be able to click on the Start button again. Now, in this particular program, we haven't provided sufficient code to safely disengage the Arduino board and it will be covered in upcoming exercises. Summary In this article, we learned about the Python library pyFirmata to interface Arduino to your computer using the Firmata protocol. We build a prototype using pyFirmata and Arduino to control servomotor and also developed another one with GUI, based on the Tkinter library, to control LEDs. Resources for Article: Further resources on this subject: Python Functions : Avoid Repeating Code? [article] Python 3 Designing Tasklist Application [article] The Five Kinds Of Python Functions Python 3.4 Edition [article]
Read more
  • 0
  • 0
  • 24158

article-image-test-driving-uitableviews-cedar
Joe Masilotti
04 Mar 2015
8 min read
Save for later

Test Driving UITableViews with Cedar

Joe Masilotti
04 Mar 2015
8 min read
One of the first things a developer does when learning iOS development is to display a list of items to the user. In iOS we use UITableViews to show one-dimensional tables of information. In practice they look like a long list of data and should be used in that way. UITableViews get their information from a UITableViewDataSource, which responds to a few delegate methods for a number of cells and what information the cells contain. This post will follow a step-by-step guide to test driving UITableViews in iOS. All code samples will use the behavior-driven testing framework Cedar. Cedar can be installed as a Cocoapod by adding the following to your Podfile: target Specs do pod Cedar end Follow this guide for installation and configuration instructions if you are having trouble or want a crash course on the framework. Unit-Style Approach One way to test table views is to follow a unit-style approach on the data source. The goal there is to call single public methods and assert that the correct state was altered or the return value was configured correctly. The target for unit testing a UITableView is its UITableViewDataSource property. The tests for this are fairly straightforward as they call -tableView:cellForRowAtIndexPath: and -tableView:numberOfCellsInSection: directly. For example, let's say we want our controller to display a table with the current list of iPhones. Our mental assertions are that this table should show a single section with nine items, one for each of the iPhone, iPhone 3G, iPhone 3GS, iPhone 4, iPhone 4s, iPhone 5, iPhone 5s, iPhone 6, and iPhone 6 Plus. The unit tests will follow a very similar pattern. Since a table defaults to one section we don't need to write a test asserting the number of sections. We can just go about testing that there are nine cells and assuming that the first and last cells text is correct, everything is working. describe(@"ViewController", ^{ __block ViewController *subject; beforeEach(^{ subject = [[ViewController alloc] init]; }); describe(@"-tableView:numberOfRowsInSection:", ^{ it(@"should have nine cells", ^{ [subject tableView:subject.tableView numberOfRowsInSection:0] should equal(9); }); }); describe(@"-tableView:cellForRowAtIndexPath:", ^{ __block UITableViewCell *cell; context(@"the first cell", ^{ beforeEach(^{ NSIndexPath *indexPath = [NSIndexPath indexPathForRow:0 inSection:0]; cell = [subject tableView:subject.tableView cellForRowAtIndexPath:indexPath]; }); it(@"should display 'iPhone'", ^{ cell.textLabel.text should equal(@"iPhone"); }); }); context(@"the last cell", ^{ beforeEach(^{ NSIndexPath *indexPath = [NSIndexPath indexPathForRow:8 inSection:0]; cell = [subject tableView:subject.tableView cellForRowAtIndexPath:indexPath]; }); it(@"should display 'iPhone 6 Plus'", ^{ cell.textLabel.text should equal(@"iPhone 6 Plus"); }); }); }); }); Now the good part about these tests is that they are easy to follow and straight to the point. When we ask how many items there are we expect the right amount. And when we want to ensure the first cell is set up correctly we test just that. Issues Unfortunately there are a few problems with this approach. The biggest issue is that we can get these tests to pass without actually displaying anything on the screen. A simple implementation of these two methods in our controller will make everything green but has no guarantee that a table view is on the screen (or that one even exists!). The first step in remedying this is to write a test asserting that the table view is a subview. Another, albeit minor, issue is we are breaking encapsulation; we are exposing that our controller conforms to the UITableViewDataSource protocol. Let's see what we can do about these two problems. Benefits Don't think that unit-style is bad, it just has different uses. If you have an app that uses multiple instances you will see benefits from this approach. This is because all you would need in your controller is to ensure the right type of data source was configured. You could take this one step farther by injecting the array of items to display and unit testing that. Then you have a repeatable unit of code that shows a list of data conforming to your app's specifications, which is quite powerful. Behavior-Driven Approach Let's take a more behavioral approach to our problem. Our goal is to display to the user the list of iPhones. If we care about what the user sees what is the closest way of replicating that? How about what cells are visible to the user? From Apple's documentation, -visibleCells on UITableView: Returns the table cells that are visible in the receiver. This sounds interesting. Let's restructure our tests to run assertions on the cells that the user sees, not some made up world of delegates and data sources. describe(@"when the view loads", ^{ beforeEach(^{ subject.view should_not be_nil; [subject.view layoutIfNeeded]; }); it(@"should display the first iPhone, first", ^{ UITableViewCell *firstCell = subject.tableView.visibleCells.firstObject; firstCell.textLabel.text should equal(@"iPhone"); }); it(@"display the iPhone 6 Plus, last", ^{ UITableViewCell *lastCell = subject.tableView.visibleCells.lastObject; lastCell.textLabel.text should equal(@"iPhone 6 Plus"); }); }); Note that in the beforeEach we assert that the view should exist. This is to kick off the controller's view lifecycle methods, namely -loadView and -viewDidLoad. We then tell its view to layout its subviews if need be. This ensures that anything we add as subviews have their layout constraints configured and applied. To get this to pass we have a few things to take care of. Create the backing array of iPhones Create the table view and add it as a subview Become the data source and respond to the calls The first one is easy so let's knock that out first. @interface ViewController () <UITableViewDataSource> @property (nonatomic) UITableView *tableView; @property (nonatomic, strong) NSArray *iPhones; @end @implementation ViewController - (instancetype)init { if (self = [super init]) { self.iPhones = @[ @"iPhone", @"iPhone 3G", @"iPhone 3GS", @"iPhone 4", @"iPhone 4s", @"iPhone 5", @"iPhone 5s", @"iPhone 6", @"iPhone 6 Plus" ]; } return self; } Note the opening up of the -tableView property in the interface extension. This allows us to keep it private in the header and the outside world while still being able to modify it internally. Next let's add the table view and its auto layout constraints. - (void)viewDidLoad { [super viewDidLoad]; self.tableView = [[UITableView alloc] init]; [self.view addSubview:self.tableView]; [self addTableViewConstraints]; } #pragma mark - Private - (void)addTableViewConstraints { self.tableView.translatesAutoresizingMaskIntoConstraints = NO; NSDictionary *views = @{ @"tableView": self.tableView }; [self.view addConstraints:[NSLayoutConstraint constraintsWithVisualFormat:@"V:|[tableView]|" options:kNilOptions metrics:nil views:views]]; [self.view addConstraints:[NSLayoutConstraint constraintsWithVisualFormat:@"H:|[tableView]|" options:kNilOptions metrics:nil views:views]]; } Since we aren't working with Storyboards or xibs/nibs we create the table view manually and add it as a subview. We also will need to add some simple auto layout constraints to have it fill the screen. Check out Apple's Auto Layout by Example guide if you would like a deeper explanation. Finally let's get to the meat of the issue and respond to the data source methods. #pragma mark - <UITableViewDataSource> - (NSInteger)tableView:(UITableView *)tableView numberOfRowsInSection:(NSInteger)section { return self.iPhones.count; } - (UITableViewCell *)tableView:(UITableView *)tableView cellForRowAtIndexPath:(NSIndexPath *)indexPath { UITableViewCell *cell = [tableView dequeueReusableCellWithIdentifier:kCellIdentifier forIndexPath:indexPath]; cell.textLabel.text = self.iPhones[indexPath.row]; return cell; } We also need to become the data source of the table so do that and register the cell in -viewDidLoad. [self.tableView registerClass:[UITableViewCell class] forCellReuseIdentifier:kCellIdentifier]; self.tableView.dataSource = self; Finally add the constant to the top of the file. NSString * const kCellIdentifier = @"CellIdentifier"; What's interesting with this approach is that not until you have every line correct with the tests pass. This helps ensure that what is happening under spec is closer to the real experience of the app. For example, having a table view on the screen, responding to the delegate calls, but not assigning the delegate won't get you anywhere. In the unit approach you could have done just that but still seen your tests go green. Benefits of Behavior Testing When testing behavior you put yourself in a world that more closely represents the state when a user is interacting with it. It also enables you to test collaboration between objects without having to single very simple piece of the architecture out. This means it can be easy to get carried away and start writing full integration tests from controllers. If you keep to only testing one or two layers of abstraction, in this case the table view through the delegate, your code and specs remain easy to read and understand. A side effect of this approach enabled us to hide some implementation details in the production code. This means we are more freely to do a green-to-green refactor without having to change our specs. For example, we could extract the UITableViewDataSource into its own object and know that it works correctly when all of the existing tests still pass. If we wanted to then reuse that collaborator we could then extract the specs and have it stand on its own. Or if our backing array turned into an NSDictionary and found everything by key nothing in our tests would have to change. There are many styles of testing and even more ways to test Objective-C code and the Cocoa Touch framework. Behavior testing is just one approach that has proved to be the most flexible and easy to understand for me. What other techniques and methods have you implemented to ensure code coverage on your own iOS apps? About the author Joe Masilotti is a test-driven iOS developer living in Brooklyn, NY. He contributes to open-source testing tools on GitHub and talks about development, cooking, and craft beer on Twitter.
Read more
  • 0
  • 0
  • 2975

Packt
04 Mar 2015
22 min read
Save for later

Python functions – Avoid repeating code

Packt
04 Mar 2015
22 min read
In this article by Silas Toms, author of the book ArcPy and ArcGIS – Geospatial Analysis with Python we will see how programming languages share a concept that has aided programmers for decades: functions. The idea of a function, loosely speaking, is to create blocks of code that will perform an action on a piece of data, transforming it as required by the programmer and returning the transformed data back to the main body of code. Functions are used because they solve many different needs within programming. Functions reduce the need to write repetitive code, which in turn reduces the time needed to create a script. They can be used to create ranges of numbers (the range() function), or to determine the maximum value of a list (the max function), or to create a SQL statement to select a set of rows from a feature class. They can even be copied and used in another script or included as part of a module that can be imported into scripts. Function reuse has the added bonus of making programming more useful and less of a chore. When a scripter starts writing functions, it is a major step towards making programming part of a GIS workflow. (For more resources related to this topic, see here.) Technical definition of functions Functions, also called subroutines or procedures in other programming languages, are blocks of code that have been designed to either accept input data and transform it, or provide data to the main program when called without any input required. In theory, functions will only transform data that has been provided to the function as a parameter; it should not change any other part of the script that has not been included in the function. To make this possible, the concept of namespaces is invoked. Namespaces make it possible to use a variable name within a function, and allow it to represent a value, while also using the same variable name in another part of the script. This becomes especially important when importing modules from other programmers; within that module and its functions, the variables that it contains might have a variable name that is the same as a variable name within the main script. In a high-level programming language such as Python, there is built-in support for functions, including the ability to define function names and the data inputs (also known as parameters). Functions are created using the keyword def plus a function name, along with parentheses that may or may not contain parameters. Parameters can also be defined with default values, so parameters only need to be passed to the function when they differ from the default. The values that are returned from the function are also easily defined. A first function Let's create a function to get a feel for what is possible when writing functions. First, we need to invoke the function by providing the def keyword and providing a name along with the parentheses. The firstFunction() will return a string when called: def firstFunction():    'a simple function returning a string'    return "My First Function" >>>firstFunction() The output is as follows: 'My First Function' Notice that this function has a documentation string or doc string (a simple function returning a string) that describes what the function does; this string can be called later to find out what the function does, using the __doc__ internal function: >>>print firstFunction.__doc__ The output is as follows: 'a simple function returning a string' The function is defined and given a name, and then the parentheses are added followed by a colon. The following lines must then be indented (a good IDE will add the indention automatically). The function does not have any parameters, so the parentheses are empty. The function then uses the keyword return to return a value, in this case a string, from the function. Next, the function is called by adding parentheses to the function name. When it is called, it will do what it has been instructed to do: return the string. Functions with parameters Now let's create a function that accepts parameters and transforms them as needed. This function will accept a number and multiply it by 3: def secondFunction(number):    'this function multiples numbers by 3'    return number *3 >>> secondFunction(4) The output is as follows: 12 The function has one flaw, however; there is no assurance that the value passed to the function is a number. We need to add a conditional to the function to make sure it does not throw an exception: def secondFunction(number):    'this function multiples numbers by 3'    if type(number) == type(1) or type(number) == type(1.0):        return number *3 >>> secondFunction(4.0) The output is as follows: 12.0 >>>secondFunction(4) The output is as follows: 12 >>>secondFunction("String") >>> The function now accepts a parameter, checks what type of data it is, and returns a multiple of the parameter whether it is an integer or a function. If it is a string or some other data type, as shown in the last example, no value is returned. There is one more adjustment to the simple function that we should discuss: parameter defaults. By including default values in the definition of the function, we avoid having to provide parameters that rarely change. If, for instance, we wanted a different multiplier than 3 in the simple function, we would define it like this: def thirdFunction(number, multiplier=3):    'this function multiples numbers by 3'    if type(number) == type(1) or type(number) == type(1.0):        return number *multiplier >>>thirdFunction(4) The output is as follows: 12 >>>thirdFunction(4,5) The output is as follows: 20 The function will work when only the number to be multiplied is supplied, as the multiplier has a default value of 3. However, if we need another multiplier, the value can be adjusted by adding another value when calling the function. Note that the second value doesn't have to be a number as there is no type checking on it. Also, the default value(s) in a function must follow the parameters with no defaults (or all parameters can have a default value and the parameters can be supplied to the function in order or by name). Using functions to replace repetitive code One of the main uses of functions is to ensure that the same code does not have to be written over and over. The first portion of the script that we could convert into a function is the three ArcPy functions. Doing so will allow the script to be applicable to any of the stops in the Bus Stop feature class and have an adjustable buffer distance: bufferDist = 400 buffDistUnit = "Feet" lineName = '71 IB' busSignage = 'Ferry Plaza' sqlStatement = "NAME = '{0}' AND BUS_SIGNAG = '{1}'" def selectBufferIntersect(selectIn,selectOut,bufferOut,     intersectIn, intersectOut, sqlStatement,   bufferDist, buffDistUnit, lineName, busSignage):    'a function to perform a bus stop analysis'    arcpy.Select_analysis(selectIn, selectOut, sqlStatement.format(lineName, busSignage))    arcpy.Buffer_analysis(selectOut, bufferOut, "{0} {1}".format(bufferDist), "FULL", "ROUND", "NONE", "")    arcpy.Intersect_analysis("{0} #;{1} #".format(bufferOut, intersectIn), intersectOut, "ALL", "", "INPUT")    return intersectOut This function demonstrates how the analysis can be adjusted to accept the input and output feature class variables as parameters, along with some new variables. The function adds a variable to replace the SQL statement and variables to adjust the bus stop, and also tweaks the buffer distance statement so that both the distance and the unit can be adjusted. The feature class name variables, defined earlier in the script, have all been replaced with local variable names; while the global variable names could have been retained, it reduces the portability of the function. The next function will accept the result of the selectBufferIntersect() function and search it using the Search Cursor, passing the results into a dictionary. The dictionary will then be returned from the function for later use: def createResultDic(resultFC):    'search results of analysis and create results dictionary' dataDictionary = {}      with arcpy.da.SearchCursor(resultFC, ["STOPID","POP10"]) as cursor:        for row in cursor:            busStopID = row[0]            pop10 = row[1]            if busStopID not in dataDictionary.keys():                dataDictionary[busStopID] = [pop10]            else:                dataDictionary[busStopID].append(pop10)    return dataDictionary This function only requires one parameter: the feature class returned from the searchBufferIntersect() function. The results holding dictionary is first created, then populated by the search cursor, with the busStopid attribute used as a key, and the census block population attribute added to a list assigned to the key. The dictionary, having been populated with sorted data, is returned from the function for use in the final function, createCSV(). This function accepts the dictionary and the name of the output CSV file as a string: def createCSV(dictionary, csvname): 'a function takes a dictionary and creates a CSV file'    with open(csvname, 'wb') as csvfile:        csvwriter = csv.writer(csvfile, delimiter=',')        for busStopID in dictionary.keys():            popList = dictionary[busStopID]            averagePop = sum(popList)/len(popList)            data = [busStopID, averagePop]            csvwriter.writerow(data) The final function creates the CSV using the csv module. The name of the file, a string, is now a customizable parameter (meaning the script name can be any valid file path and text file with the extension .csv). The csvfile parameter is passed to the CSV module's writer method and assigned to the variable csvwriter, and the dictionary is accessed and processed, and passed as a list to csvwriter to be written to the CSV file. The csv.writer() method processes each item in the list into the CSV format and saves the final result. Open the CSV file with Excel or a text editor such as Notepad. To run the functions, we will call them in the script following the function definitions: analysisResult = selectBufferIntersect(Bus_Stops,Inbound71, Inbound71_400ft_buffer, CensusBlocks2010, Intersect71Census, bufferDist, lineName,                busSignage ) dictionary = createResultDic(analysisResult) createCSV(dictionary,r'C:\Projects\Output\Averages.csv') Now, the script has been divided into three functions, which replace the code of the first modified script. The modified script looks like this: # -*- coding: utf-8 -*- # --------------------------------------------------------------------------- # 8662_Chapter4Modified1.py # Created on: 2014-04-22 21:59:31.00000 #   (generated by ArcGIS/ModelBuilder) # Description: # Adjusted by Silas Toms # 2014 05 05 # ---------------------------------------------------------------------------   # Import arcpy module import arcpy import csv   # Local variables: Bus_Stops = r"C:\Projects\PacktDB.gdb\SanFrancisco\Bus_Stops" CensusBlocks2010 = r"C:\Projects\PacktDB.gdb\SanFrancisco\CensusBlocks2010" Inbound71 = r"C:\Projects\PacktDB.gdb\Chapter3Results\Inbound71" Inbound71_400ft_buffer = r"C:\Projects\PacktDB.gdb\Chapter3Results\Inbound71_400ft_buffer" Intersect71Census = r"C:\Projects\PacktDB.gdb\Chapter3Results\Intersect71Census" bufferDist = 400 lineName = '71 IB' busSignage = 'Ferry Plaza' def selectBufferIntersect(selectIn,selectOut,bufferOut,intersectIn,                          intersectOut, bufferDist,lineName, busSignage ):    arcpy.Select_analysis(selectIn,                          selectOut,                           "NAME = '{0}' AND BUS_SIGNAG = '{1}'".format(lineName, busSignage))    arcpy.Buffer_analysis(selectOut,                          bufferOut,                          "{0} Feet".format(bufferDist),                          "FULL", "ROUND", "NONE", "")    arcpy.Intersect_analysis("{0} #;{1} #".format(bufferOut,intersectIn),                              intersectOut, "ALL", "", "INPUT")    return intersectOut   def createResultDic(resultFC):    dataDictionary = {}       with arcpy.da.SearchCursor(resultFC,                                ["STOPID","POP10"]) as cursor:        for row in cursor:            busStopID = row[0]            pop10 = row[1]            if busStopID not in dataDictionary.keys():                dataDictionary[busStopID] = [pop10]            else:                dataDictionary[busStopID].append(pop10)    return dataDictionary   def createCSV(dictionary, csvname):    with open(csvname, 'wb') as csvfile:        csvwriter = csv.writer(csvfile, delimiter=',')        for busStopID in dictionary.keys():            popList = dictionary[busStopID]            averagePop = sum(popList)/len(popList)            data = [busStopID, averagePop]            csvwriter.writerow(data) analysisResult = selectBufferIntersect(Bus_Stops,Inbound71, Inbound71_400ft_buffer,CensusBlocks2010,Intersect71Census, bufferDist,lineName, busSignage ) dictionary = createResultDic(analysisResult) createCSV(dictionary,r'C:\Projects\Output\Averages.csv') print "Data Analysis Complete" Further generalization of the functions, while we have created functions from the original script that can be used to extract more data about bus stops in San Francisco, our new functions are still very specific to the dataset and analysis for which they were created. This can be very useful for long and laborious analysis for which creating reusable functions is not necessary. The first use of functions is to get rid of the need to repeat code. The next goal is to then make that code reusable. Let's discuss some ways in which we can convert the functions from one-offs into reusable functions or even modules. First, let's examine the first function: def selectBufferIntersect(selectIn,selectOut,bufferOut,intersectIn,                          intersectOut, bufferDist,lineName, busSignage ):    arcpy.Select_analysis(selectIn,                          selectOut,                          "NAME = '{0}' AND BUS_SIGNAG = '{1}'".format(lineName, busSignage))    arcpy.Buffer_analysis(selectOut,                          bufferOut,                          "{0} Feet".format(bufferDist),                          "FULL", "ROUND", "NONE", "")    arcpy.Intersect_analysis("{0} #;{1} #".format(bufferOut,intersectIn),                              intersectOut, "ALL", "", "INPUT")    return intersectOut This function appears to be pretty specific to the bus stop analysis. It's so specific, in fact, that while there are a few ways in which we can tweak it to make it more general (that is, useful in other scripts that might not have the same steps involved), we should not convert it into a separate function. When we create a separate function, we introduce too many variables into the script in an effort to simplify it, which is a counterproductive effort. Instead, let's focus on ways to generalize the ArcPy tools themselves. The first step will be to split the three ArcPy tools and examine what can be adjusted with each of them. The Select tool should be adjusted to accept a string as the SQL select statement. The SQL statement can then be generated by another function or by parameters accepted at runtime. For instance, if we wanted to make the script accept multiple bus stops for each run of the script (for example, the inbound and outbound stops for each line), we could create a function that would accept a list of the desired stops and a SQL template, and would return a SQL statement to plug into the Select tool. Here is an example of how it would look: def formatSQLIN(dataList, sqlTemplate):    'a function to generate a SQL statement'    sql = sqlTemplate #"OBJECTID IN "    step = "("    for data in dataList:        step += str(data)    sql += step + ")"    return sql   def formatSQL(dataList, sqlTemplate):    'a function to generate a SQL statement'    sql = ''    for count, data in enumerate(dataList):        if count != len(dataList)-1:            sql += sqlTemplate.format(data) + ' OR '        else:            sql += sqlTemplate.format(data)    return sql   >>> dataVals = [1,2,3,4] >>> sqlOID = "OBJECTID = {0}" >>> sql = formatSQL(dataVals, sqlOID) >>> print sql The output is as follows: OBJECTID = 1 OR OBJECTID = 2 OR OBJECTID = 3 OR OBJECTID = 4 This new function, formatSQL(), is a very useful function. Let's review what it does by comparing the function to the results following it. The function is defined to accept two parameters: a list of values and a SQL template. The first local variable is the empty string sql, which will be added to using string addition. The function is designed to insert the values into the variable sql, creating a SQL statement by taking the SQL template and using string formatting to add them to the template, which in turn is added to the SQL statement string (note that sql += is equivelent to sql = sql +). Also, an operator (OR) is used to make the SQL statement inclusive of all data rows that match the pattern. This function uses the built-in enumerate function to count the iterations of the list; once it has reached the last value in the list, the operator is not added to the SQL statement. Note that we could also add one more parameter to the function to make it possible to use an AND operator instead of OR, while still keeping OR as the default: def formatSQL2(dataList, sqlTemplate, operator=" OR "):    'a function to generate a SQL statement'    sql = ''    for count, data in enumerate(dataList):        if count != len(dataList)-1:            sql += sqlTemplate.format(data) + operator        else:            sql += sqlTemplate.format(data)    return sql   >>> sql = formatSQL2(dataVals, sqlOID," AND ") >>> print sql The output is as follows: OBJECTID = 1 AND OBJECTID = 2 AND OBJECTID = 3 AND OBJECTID = 4 While it would make no sense to use an AND operator on ObjectIDs, there are other cases where it would make sense, hence leaving OR as the default while allowing for AND. Either way, this function can now be used to generate our bus stop SQL statement for multiple stops (ignoring, for now, the bus signage field): >>> sqlTemplate = "NAME = '{0}'" >>> lineNames = ['71 IB','71 OB'] >>> sql = formatSQL2(lineNames, sqlTemplate) >>> print sql The output is as follows: NAME = '71 IB' OR NAME = '71 OB' However, we can't ignore the Bus Signage field for the inbound line, as there are two starting points for the line, so we will need to adjust the function to accept multiple values: def formatSQLMultiple(dataList, sqlTemplate, operator=" OR "):    'a function to generate a SQL statement'    sql = ''    for count, data in enumerate(dataList):        if count != len(dataList)-1:            sql += sqlTemplate.format(*data) + operator        else:            sql += sqlTemplate.format(*data)    return sql   >>> sqlTemplate = "(NAME = '{0}' AND BUS_SIGNAG = '{1}')" >>> lineNames = [('71 IB', 'Ferry Plaza'),('71 OB','48th Avenue')] >>> sql = formatSQLMultiple(lineNames, sqlTemplate) >>> print sql The output is as follows: (NAME = '71 IB' AND BUS_SIGNAG = 'Ferry Plaza') OR (NAME = '71 OB' AND BUS_SIGNAG = '48th Avenue') The slight difference in this function, the asterisk before the data variable, allows the values inside the data variable to be correctly formatted into the SQL template by exploding the values within the tuple. Notice that the SQL template has been created to segregate each conditional by using parentheses. The function(s) are now ready for reuse, and the SQL statement is now ready for insertion into the Select tool: sql = formatSQLMultiple(lineNames, sqlTemplate) arcpy.Select_analysis(Bus_Stops, Inbound71, sql) Next up is the Buffer tool. We have already taken steps towards making it generalized by adding a variable for the distance. In this case, we will only add one more variable to it, a unit variable that will make it possible to adjust the buffer unit from feet to meter or any other allowed unit. We will leave the other defaults alone. Here is an adjusted version of the Buffer tool: bufferDist = 400 bufferUnit = "Feet" arcpy.Buffer_analysis(Inbound71,                      Inbound71_400ft_buffer,                      "{0} {1}".format(bufferDist, bufferUnit),                      "FULL", "ROUND", "NONE", "") Now, both the buffer distance and buffer unit are controlled by a variable defined in the previous script, and this will make it easily adjustable if it is decided that the distance was not sufficient and the variables might need to be adjusted. The next step towards adjusting the ArcPy tools is to write a function, which will allow for any number of feature classes to be intersected together using the Intersect tool. This new function will be similar to the formatSQL functions as previous, as they will use string formatting and addition to allow for a list of feature classes to be processed into the correct string format for the Intersect tool to accept them. However, as this function will be built to be as general as possible, it must be designed to accept any number of feature classes to be intersected: def formatIntersect(features):    'a function to generate an intersect string'    formatString = ''    for count, feature in enumerate(features):        if count != len(features)-1:            formatString += feature + " #;"        else:            formatString += feature + " #"        return formatString >>> shpNames = ["example.shp","example2.shp"] >>> iString = formatIntersect(shpNames) >>> print iString The output is as follows: example.shp #;example2.shp # Now that we have written the formatIntersect() function, all that needs to be created is a list of the feature classes to be passed to the function. The string returned by the function can then be passed to the Intersect tool: intersected = [Inbound71_400ft_buffer, CensusBlocks2010] iString = formatIntersect(intersected) # Process: Intersect arcpy.Intersect_analysis(iString,                          Intersect71Census, "ALL", "", "INPUT") Because we avoided creating a function that only fits this script or analysis, we now have two (or more) useful functions that can be applied in later analyses, and we know how to manipulate the ArcPy tools to accept the data that we want to supply to them. Summary In this article, we discussed how to take autogenerated code and make it generalized, while adding functions that can be reused in other scripts and will make the generation of the necessary code components, such as SQL statements, much easier. Resources for Article: Further resources on this subject: Enterprise Geodatabase [article] Adding Graphics to the Map [article] Image classification and feature extraction from images [article]
Read more
  • 0
  • 0
  • 27288

article-image-ios-security-overview
Packt
04 Mar 2015
20 min read
Save for later

iOS Security Overview

Packt
04 Mar 2015
20 min read
In this article by Allister Banks and Charles S. Edge, the authors of the book, Learning iOS Security, we will go through an overview of the basic security measures followed in an iOS. Out of the box, iOS is one of the most secure operating systems available. There are a number of factors that contribute to the elevated security level. These include the fact that users cannot access the underlying operating system. Apps also have data in a silo (sandbox), so instead of accessing the system's internals they can access the silo. App developers choose whether to store settings such as passwords in the app or on iCloud Keychain, which is a secure location for such data on a device. Finally, Apple has a number of controls in place on devices to help protect users while providing an elegant user experience. However, devices can be made even more secure than they are now. In this article, we're going to get some basic security tasks under our belt in order to get some basic best practices of security. Where we feel more explanation is needed about what we did on devices, we'll explore a part of the technology itself in this article. This article will cover the following topics: Pairing Backing up your device Initial security checklist Safari and built-in app protection Predictive search and spotlight (For more resources related to this topic, see here.) To kick off the overview of iOS security, we'll quickly secure our systems by initially providing a simple checklist of tasks, where we'll configure a few device protections that we feel everyone should use. Then, we'll look at how to take a backup of our devices and finally, at how to use a built-in web browser and protections around a browser. Pairing When you connect a device to a computer that runs iTunes for the first time, you are prompted to enter a password. Doing so allows you to synchronize the device to a computer. Applications that can communicate over this channel include iTunes, iPhoto, Xcode, and others. To pair a device to a Mac, simply plug the device in (if you have a passcode, you'll need to enter that in order to pair the device.) When the device is plugged in, you'll be prompted on both the device and the computer to establish a trust. Simply tap on Trust on the iOS device, as shown in the following screenshot: Trusting a computer For the computer to communicate with the iOS device, you'll also need to accept the pairing on your computer (although, when you use libimobiledevice, which is the command to pair, does not require doing so, because you use the command line to accept). When prompted, click on Continue to establish the pairing, as seen in the following screenshot (the screenshot is the same in Windows): Trusting a device When a device is paired, a file is created in /var/db/lockdown, which is the UDID of the device with a property list (plist) extension. A property list is an Apple XML file that stores a variety of attributes. In Windows, iOS data is stored in the MobileSync folder, which you can access by navigating to Users(username)AppDataRoamingApple ComputerMobileSync. The information in this file sets up a trust between the computers and includes the following attributes: DeviceCertificate: This certificate is unique to each device. EscrowBag: The key bag of EscrowBag contains class keys used to decrypt the device. HostCertificate: This certificate is for the host who's paired with iOS devices (usually, the same for all files that you've paired devices with, on your computer). HostID: This is a generated ID for the host. HostPrivateKey: This is the private key for your Mac (should be the same in all files on a given computer). RootCertificate: This is the certificate used to generate keys (should be the same in all files on a given computer). RootPrivateKey: This is the private key of the computer that runs iTunes for that device. SystemBUID: This refers to the ID of the computer that runs iTunes. WiFiMACAddress: This is the Mac address of the Wi-Fi interface of the device that is paired to the computer. If you do not have an active Wi-Fi interface, MAC is still used while pairing. Why does this matter? It's important to know how a device interfaces with a computer. These files can be moved between computers and contain a variety of information about a device, including private keys. Having keys isn't all that is required for a computer to communicate with a device. When the devices are interfacing with a computer over USB, if you have a passcode enabled on the device, you will be required to enter that passcode in order to unlock the device. Once a computer is able to communicate with a device, you need to be careful as the backups of a device, apps that get synchronized to a device, and other data that gets exchanged with a device can be exposed while at rest on devices. Backing up your device What do most people do to maximize the security of iOS devices? Before we do anything, we need to take a backup of our devices. This protects the device from us by providing a restore point. This also secures the data from the possibility of losing it through a silly mistake. There are two ways, which are most commonly used to take backups: iCloud and iTunes. As the names imply, the first makes backups for the data on Apple's cloud service and the second on desktop computers. We'll cover how to take a backup on iCloud first. iCloud backups An iCloud account comes with free storage, to back up your Apple devices. An iOS device takes a backup to Apple servers and can be restored when a new device is set up from those same servers (it's a screen that appears during the activation process of a new device. Also, it appears as an option in iTunes if you back up to iTunes over USB—covered later in this article). Setting up and checking the status of iCloud backups is a straightforward process. From the Settings app, tap on iCloud and then Backup. As you can see from the Backup screen, you have two options, iCloud Backup, which enables automatic backups of the device to your iCloud account, and Back Up Now, which runs an immediate backup of the device. iCloud backups Allowing iCloud to take backups on devices is optional. You can disable access to iCloud and iCloud backups. However, doing so is rarely a good idea as you are limiting the functionality of the device and putting the data on your device at risk, if that data isn't backed up another way such as through iTunes. Many people have reservations about storing data on public clouds; especially, data as private as phone data (texts, phone call history, and so on). For more information on Apple's security and privacy around iCloud, refer to http://support.apple.com/en-us/HT202303. If you do not trust Apple or it's cloud, then you can also take a backup of your device using iTunes, described in the next section. Taking backups using iTunes Originally, iTunes was used to take backups for iOS devices. You can still use iTunes and it's likely you will have a second backup even if you are using iCloud, simply for a quick restore if nothing else. Backups are usually pretty small. The reason is that the operating system is not part of backups, since users can't edit any of those files. Therefore, you can use an ipsw file (the operating system) to restore a device. These are accessed through Apple Configurator or through iTunes if you have a restore file waiting to be installed. These can be seen in ~/Library/iTunes, and the name of the device and its software updates, as can be seen in the following screenshot: IPSW files Backups are stored in the ~/Library/Application Support/MobileSync/Backup directory. Here, you'll see a number of directories that are associated with the UDID of the devices, and within those, you'll see a number of files that make up the modular incremental backups beyond the initial backup. It's a pretty smart system and allows you to restore a device at different points in time without taking too long to perform each backup. Backups are stored in the Documents and SettingsUSERNAMEApplication DataApple ComputerMobileSyncBackup directory on Windows XP and in the UsersUSERNAMEAppDataRoamingApple ComputerMobileSyncBackup directory for newer operating systems. To enable an iTunes back up, plug a device into a computer, and then open iTunes. Click on the device for it to show the device details screen. The top section of the screen is for Backups (in the following screenshot, you can set a back up to This computer, which takes a backup on the computer you are on). I would recommend you to always choose the Encrypt iPhone backup option as it forces you to save a password in order to restore the back up. Additionally, you can use the Back Up Now button to kick off the first back up, as shown in the following screenshot: iTunes Viewing iOS data in iTunes To show why it's important to encrypt backups, let's look at what can be pulled out of those backups. There are a few tools that can extract backups, provided you have a password. Here, we'll look at iBackup Extractor to view the backup of your browsing history, calendars, call history, contacts, iMessages, notes, photos, and voicemails. To get started, download iBackup Extractor from http://www.wideanglesoftware.com/ibackupextractor. When you open iBackup Extractor for the first time, simply choose the device backup you wish to extract in iBackup Extractor. As you can see in following screenshot, you will be prompted for a password in order to unlock the Backup key bag. Enter the password to unlock the system. Unlock the backups Note that the file tree in the following screenshot gives away some information on the structure of the iOS filesystem, or at least, the data stored in the backups of the iOS device. For now, simply click on Browser to see a list of files that can be extracted from the backup, as you can see in the next screenshot: View Device Contents Using iBackup Extractor Note the prevalence of SQL databases in the files. Most apps use these types of databases to store data on devices. Also, check out the other options such as extracting notes (many that were possibly deleted), texts (some that have been deleted from devices), and other types of data from devices. Now that we've exhausted backups and proven that you should really put a password in place for your back ups, let's finally get to some basic security tasks to be performed on these devices! Initial security checklist Apple has built iOS to be one of the most secure operating systems in the world. This has been made possible by restricting access to much of the operating system by end users, unless you jailbreak a device. In this article, we won't cover jail-breaking devices much due to the fact that securing the devices then becomes a whole new topic. Instead, we have focused on what you need to do, how you can do those tasks, what the impacts are, and, how to manage security settings based on a policy. The basic steps required to secure an iOS device start with encrypting devices, which is done by assigning a passcode to a device. We will then configure how much inactive time before a device requires a PIN and accordingly manage the privacy settings. These settings allow us to get some very basic security features under our belt, and set the stage to explain what some of the features actually do. Configuring a passcode The first thing most of us need to do on an iOS device is configure a passcode for the device. Several things happen when a passcode is enabled, as shown in the following steps: The device is encrypted. The device then requires a passcode to wake up. An idle timeout is automatically set that puts the device to sleep after a few minutes of inactivity. This means that three of the most important things you can do to secure a device are enabled when you set up a passcode. Best of all, Apple recommends setting up a passcode during the initial set up of new devices. You can manage passcode settings using policies (or profiles as Apple likes to call them in iOS). Best of all—you can set a passcode and then use your fingerprint on the Home button instead of that passcode. We have found that by the time our phone is out of our pocket and if our finger is on the home button, the device is unlocked by the time we check it. With iPhone 6 and higher versions, you can now use that same fingerprint to secure payment information. Check whether a passcode has been configured, and if needed, configure a passcode using the Settings app. The Settings app is by default on the Home screen where many settings on the device, including Wi-Fi networks the device has been joined to, app preferences, mail accounts, and other settings are configured. To set a passcode, open the Settings app and tap on Touch ID & Passcode If a passcode has been set, you will see the Turn Passcode Off (as seen in the following screenshot) option If a passcode has not been set, then you can do so at this screen as well Additionally, you can change a passcode that has been set using the Change Passcode button and define a fingerprint or additional fingerprints that can be used with a touch ID There are two options in the USE TOUCH ID FOR section of the screen. You can choose whether, or not, you need to enter the passcode in order to unlock a phone, which you should use unless the device is also used by small children or as a kiosk. In these cases, you don't need to encrypt or take a backup of the device anyway. The second option is to force the entering of a passcode while using the App Store and iTunes. This can cost you money if someone else is using your device, so let the default value remain, which requires you to enter a passcode to unlock the options. Configure a Passcode The passcode settings are very easy to configure; so, they should be configured when possible. Scroll down on this screen and you'll see several other features, as shown in the next screenshot. The first option on the screen is Simple Passcode. Most users want to use a simple pin with an iOS device. Trying to use alphanumeric and long passcodes simply causes most users to try to circumvent the requirement. To add a fingerprint as a passcode, simply tap on Add a Fingerprint…, which you can see in the preceding screenshot, and follow the onscreen instructions. Additionally, the following can be accessed when the device is locked, and you can choose to turn them off: Today: This shows an overview of upcoming calendar items Notifications View: This shows you the recent push notifications (apps that have updates on the device) Siri: This represents the voice control of the device Passbook: This tool is used to make payments and display tickets for concert venues and meetups Reply with Message: This tool allows you to send a text reply to an incoming call (useful if you're on the treadmill) Each organization can decide whether it considers these options to be a security risk and direct users how to deal with them, or they can implement a policy around these options. Passcode Settings There aren't a lot of security options around passcodes and encryption, because by and large, Apple secures the device by giving you fewer options than you'll actually use. Under the hood, (for example, through Apple Configurator and Mobile Device Management) there are a lot of other options, but these aren't exposed to end users of devices. For the most part, a simple four-character passcode will suffice for most environments. When you complicate passcodes, devices become much more difficult to unlock, and users tend to look for ways around passcode enforcement policies. The passcode is only used on the device, so complicating the passcode will only reduce the likelihood that a passcode would be guessed before swiping open a device, which typically occurs within 10 tries. Finally, to disable a passcode and therefore encryption, simply go to the Touch ID & Passcode option in the Settings app and tap on Turn Passcode Off. Configuring privacy settings Once a passcode is set and the device is encrypted, it's time to configure the privacy settings. Third-party apps cannot communicate with one another by default in iOS. Therefore, you must enable communication between them (also between third-party apps and built-in iOS apps that have APIs). This is a fundamental concept when it comes to securing iOS devices. To configure privacy options, open the Settings app and tap on the entry for Privacy. On the Privacy screen, you'll see a list of each app that can be communicated with by other apps, as shown in the following screenshot: Privacy Options As an example, tap on the Location Services entry, as shown in the next screenshot. Here, you can set which apps can communicate with Location Services and when. If an app is set to While Using, the app can communicate with Location Services when the app is open. If an app is set to Always, then the app can only communicate with Location Services when the app is open and not when it runs in the background. Configure Location Services On the Privacy screen, tap on Photos. Here, you have fewer options because unlike the location of a device, you can't access photos when the app is running in the background. Here, you can enable or disable an app by communicating with the photo library on a device, as seen in the next screenshot: Configure What Apps Can Access Your Camera Roll Each app should be configured in such a way that it can communicate with the features of iOS or other apps that are absolutely necessary. Other privacy options which you can consider disabling include Siri and Handoff. Siri has the voice controls of an iOS. Because Siri can be used even when your phone is locked, consider to disable it by opening the Settings app, tapping on General and then on Siri, and you will be able disable the voice controls. To disable Handoff, you should use the General System Preference pane in any OS X computer paired to an iOS device. There, uncheck the Allow Handoff between this Mac and your iCloud devices option. Safari and built-in App protections Web browsers have access to a lot of data. One of the most popular targets on other platforms has been web browsers. The default browser on an iOS device is Safari. Open the Settings app and then tap on Safari. The Safari preferences to secure iOS devices include the following: Passwords & AutoFill: This is a screen that includes contact information, a list of saved passwords and credit cards used in web browsers. This data is stored in an iCloud Keychain if iCloud Keychain has been enabled in your phone. Favorites: This performs the function of bookmark management. This shows bookmarks in iOS. Open Links: This configures how links are managed. Block Pop-ups: This enables a pop-up blocker. Scroll down and you'll see the Privacy & Security options (as seen in the next screenshot). Here, you can do the following: Do Not Track: By this, you can block the tracking of browsing activity by websites. Block Cookies: A cookie is a small piece of data sent from a website to a visitor's browser. Many sites will send cookies to third-party sites, so the management of cookies becomes an obstacle to the privacy of many. By default, Safari only allows cookies from websites that you visit (Allow from Websites I Visit). Set the Cookies option to Always Block in order to disable its ability to accept any cookies; set the option to Always Allow to accept cookies from any source; and set the option to Allow from Current Website Only to only allow cookies from certain websites. Fraudulent Website Warning: This blocks phishing attacks (sites that only exist to steal personal information). Clear History and Website Data: This clears any cached history, web files, and passwords from the Safari browser. Use Cellular Data: When this option is turned off, it disables web traffic over cellular connections (so web traffic will only work when the phone is connected to a Wi-Fi network). Configure Privacy Settings for Safari There are also a number of advanced options that can be accessed by clicking on the Advanced button, as shown in the following screenshot: Configure the Advanced Safari Options These advanced options include the following: Website Data: This option (as you can see in the next screenshot) shows the amount of data stored from each site that caches files on the device, and allows you to swipe left on these entries to access any files saved for the site. Tap on Remove All Website Data to remove data for all the sites at once. JavaScript: This allows you to disable any JavaScripts from running on sites the device browses. Web Inspector: This shows the device in the Develop menu on a computer connected to the device. If the Web Inspector option has been disabled, use Advanced Preferences in the Safari Preferences option of Safari. View Website Data On Devices Browser security is an important aspect of any operating system. Predictive search and spotlight The final aspect of securing the settings on an iOS device that we'll cover in this article includes predictive search and spotlight. When you use the spotlight feature in iOS, usage data is sent to Apple along with the information from Location Services. Additionally, you can search for anything on a device, including items previously blocked from being accessed. The ability to search for blocked content warrants the inclusion in locking down a device. That data is then used to generate future searches. This feature can be disabled by opening the Settings app, tap on Privacy, then Location Services, and then System Services. Simply slide Spotlight Suggestions to Off to disable the location data from going over that connection. To limit the type of data that spotlight sends, open the Settings app, tap on General, and then on Spotlight Search. Uncheck each item you don't want indexed in the Spotlight database. The following screenshot shows the mentioned options: Configure What Spotlight Indexes These were some of the basic tactical tasks that secure devices. Summary This article was a whirlwind of quick changes that secure a device. Here, we paired devices, took a backup, set a passcode, and secured app data and Safari. We showed how to manually do some tasks that are set via policies. Resources for Article: Further resources on this subject: Creating a Brick Breaking Game [article] New iPad Features in iOS 6 [article] Sparrow iOS Game Framework - The Basics of Our Game [article]
Read more
  • 0
  • 0
  • 13184

article-image-knockoutjs-templates
Packt
04 Mar 2015
38 min read
Save for later

KnockoutJS Templates

Packt
04 Mar 2015
38 min read
 In this article by Jorge Ferrando, author of the book KnockoutJS Essentials, we are going talk about how to design our templates with the native engine and then we will speak about mechanisms and external libraries we can use to improve the Knockout template engine. When our code begins to grow, it's necessary to split it in several parts to keep it maintainable. When we split JavaScript code, we are talking about modules, classes, function, libraries, and so on. When we talk about HTML, we call these parts templates. KnockoutJS has a native template engine that we can use to manage our HTML. It is very simple, but also has a big inconvenience: templates, it should be loaded in the current HTML page. This is not a problem if our app is small, but it could be a problem if our application begins to need more and more templates. (For more resources related to this topic, see here.) Preparing the project First of all, we are going to add some style to the page. Add a file called style.css into the css folder. Add a reference in the index.html file, just below the bootstrap reference. The following is the content of the file: .container-fluid { margin-top: 20px; } .row { margin-bottom: 20px; } .cart-unit { width: 80px; } .btn-xs { font-size:8px; } .list-group-item { overflow: hidden; } .list-group-item h4 { float:left; width: 100px; } .list-group-item .input-group-addon { padding: 0; } .btn-group-vertical > .btn-default { border-color: transparent; } .form-control[disabled], .form-control[readonly] { background-color: transparent !important; } Now remove all the content from the body tag except for the script tags and paste in these lines: <div class="container-fluid"> <div class="row" id="catalogContainer">    <div class="col-xs-12"       data-bind="template:{name:'header'}"></div>    <div class="col-xs-6"       data-bind="template:{name:'catalog'}"></div>    <div id="cartContainer" class="col-xs-6 well hidden"       data-bind="template:{name:'cart'}"></div> </div> <div class="row hidden" id="orderContainer"     data-bind="template:{name:'order'}"> </div> <div data-bind="template: {name:'add-to-catalog-modal'}"></div> <div data-bind="template: {name:'finish-order-modal'}"></div> </div> Let's review this code. We have two row classes. They will be our containers. The first container is named with the id value as catalogContainer and it will contain the catalog view and the cart. The second one is referenced by the id value as orderContainer and we will set our final order there. We also have two more <div> tags at the bottom that will contain the modal dialogs to show the form to add products to our catalog and the other one will contain a modal message to tell the user that our order is finished. Along with this code you can see a template binding inside the data-bind attribute. This is the binding that Knockout uses to bind templates to the element. It contains a name parameter that represents the ID of a template. <div class="col-xs-12" data-bind="template:{name:'header'}"></div> In this example, this <div> element will contain the HTML that is inside the <script> tag with the ID header. Creating templates Template elements are commonly declared at the bottom of the body, just above the <script> tags that have references to our external libraries. We are going to define some templates and then we will talk about each one of them: <!-- templates --> <script type="text/html" id="header"></script> <script type="text/html" id="catalog"></script> <script type="text/html" id="add-to-catalog-modal"></script> <script type="text/html" id="cart-widget"></script> <script type="text/html" id="cart-item"></script> <script type="text/html" id="cart"></script> <script type="text/html" id="order"></script> <script type="text/html" id="finish-order-modal"></script> Each template name is descriptive enough by itself, so it's easy to know what we are going to set inside them. Let's see a diagram showing where we dispose each template on the screen:   Notice that the cart-item template will be repeated for each item in the cart collection. Modal templates will appear only when a modal dialog is displayed. Finally, the order template is hidden until we click to confirm the order. In the header template, we will have the title and the menu of the page. The add-to-catalog-modal template will contain the modal that shows the form to add a product to our catalog. The cart-widget template will show a summary of our cart. The cart-item template will contain the template of each item in the cart. The cart template will have the layout of the cart. The order template will show the final list of products we want to buy and a button to confirm our order. The header template Let's begin with the HTML markup that should contain the header template: <script type="text/html" id="header"> <h1>    Catalog </h1>   <button class="btn btn-primary btn-sm" data-toggle="modal"     data-target="#addToCatalogModal">    Add New Product </button> <button class="btn btn-primary btn-sm" data-bind="click:     showCartDetails, css:{ disabled: cart().length < 1}">    Show Cart Details </button> <hr/> </script> We define a <h1> tag, and two <button> tags. The first button tag is attached to the modal that has the ID #addToCatalogModal. Since we are using Bootstrap as the CSS framework, we can attach modals by ID using the data-target attribute, and activate the modal using the data-toggle attribute. The second button will show the full cart view and it will be available only if the cart has items. To achieve this, there are a number of different ways. The first one is to use the CSS-disabled class that comes with Twitter Bootstrap. This is the way we have used in the example. CSS binding allows us to activate or deactivate a class in the element depending on the result of the expression that is attached to the class. The other method is to use the enable binding. This binding enables an element if the expression evaluates to true. We can use the opposite binding, which is named disable. There is a complete documentation on the Knockout website http://knockoutjs.com/documentation/enable-binding.html: <button class="btn btn-primary btn-sm" data-bind="click:   showCartDetails, enable: cart().length > 0"> Show Cart Details </button>   <button class="btn btn-primary btn-sm" data-bind="click:   showCartDetails, disable: cart().length < 1"> Show Cart Details </button> The first method uses CSS classes to enable and disable the button. The second method uses the HTML attribute, disabled. We can use a third option, which is to use a computed observable. We can create a computed observable variable in our view-model that returns true or false depending on the length of the cart: //in the viewmodel. Remember to expose it var cartHasProducts = ko.computed(function(){ return (cart().length > 0); }); //HTML <button class="btn btn-primary btn-sm" data-bind="click:   showCartDetails, enable: cartHasProducts"> Show Cart Details </button> To show the cart, we will use the click binding. Now we should go to our viewmodel.js file and add all the information we need to make this template work: var cart = ko.observableArray([]); var showCartDetails = function () { if (cart().length > 0) {    $("#cartContainer").removeClass("hidden"); } }; And you should expose these two objects in the view-model: return {    searchTerm: searchTerm,    catalog: filteredCatalog,    newProduct: newProduct,    totalItems:totalItems,    addProduct: addProduct,    cart: cart,    showCartDetails: showCartDetails, }; The catalog template The next step is to define the catalog template just below the header template: <script type="text/html" id="catalog"> <div class="input-group">    <span class="input-group-addon">      <i class="glyphicon glyphicon-search"></i> Search    </span>    <input type="text" class="form-control" data-bind="textInput:       searchTerm"> </div> <table class="table">    <thead>    <tr>      <th>Name</th>      <th>Price</th>      <th>Stock</th>      <th></th>    </tr>    </thead>    <tbody data-bind="foreach:catalog">    <tr data-bind="style:color:stock() < 5?'red':'black'">      <td data-bind="text:name"></td>      <td data-bind="text:price"></td>      <td data-bind="text:stock"></td>      <td>        <button class="btn btn-primary"          data-bind="click:$parent.addToCart">          <i class="glyphicon glyphicon-plus-sign"></i> Add        </button>      </td>    </tr>    </tbody>    <tfoot>    <tr>      <td colspan="3">        <strong>Items:</strong><span           data-bind="text:catalog().length"></span>      </td>      <td colspan="1">        <span data-bind="template:{name:'cart-widget'}"></span>      </td>    </tr>    </tfoot> </table> </script> Now, each line uses the style binding to alert the user, while they are shopping, that the stock is reaching the maximum limit. The style binding works the same way that CSS binding does with classes. It allows us to add style attributes depending on the value of the expression. In this case, the color of the text in the line must be black if the stock is higher than five, and red if it is four or less. We can use other CSS attributes, so feel free to try other behaviors. For example, set the line of the catalog to green if the element is inside the cart. We should remember that if an attribute has dashes, you should wrap it in single quotes. For example, background-color will throw an error, so you should write 'background-color'. When we work with bindings that are activated depending on the values of the viewmodel, it is good practice to use computed observables. Therefore, we can create a computed value in our product model that returns the value of the color that should be displayed: //In the Product.js var _lineColor = ko.computed(function(){ return (_stock() < 5)? 'red' : 'black'; }); return { lineColor:_lineColor }; //In the template <tr data-bind="style:lineColor"> ... </tr> It would be even better if we create a class in our style.css file that is called stock-alert and we use the CSS binding: //In the style file .stock-alert { color: #f00; } //In the Product.js var _hasStock = ko.computed(function(){ return (_stock() < 5);   }); return { hasStock: _hasStock }; //In the template <tr data-bind="css: hasStock"> ... </tr> Now, look inside the <tfoot> tag. <td colspan="1"> <span data-bind="template:{name:'cart-widget'}"></span> </td> As you can see, we can have nested templates. In this case, we have the cart-widget template inside our catalog template. This give us the possibility of having very complex templates, splitting them into very small pieces, and combining them, to keep our code clean and maintainable. Finally, look at the last cell of each row: <td> <button class="btn btn-primary"     data-bind="click:$parent.addToCart">    <i class="glyphicon glyphicon-plus-sign"></i> Add </button> </td> Look at how we call the addToCart method using the magic variable $parent. Knockout gives us some magic words to navigate through the different contexts we have in our app. In this case, we are in the catalog context and we want to call a method that lies one level up. We can use the magical variable called $parent. There are other variables we can use when we are inside a Knockout context. There is complete documentation on the Knockout website http://knockoutjs.com/documentation/binding-context.html. In this project, we are not going to use all of them. But we are going quickly explain these binding context variables, just to understand them better. If we don't know how many levels deep we are, we can navigate to the top of the view-model using the magic word $root. When we have many parents, we can get the magic array $parents and access each parent using indexes, for example, $parents[0], $parents[1]. Imagine that you have a list of categories where each category contains a list of products. These products are a list of IDs and the category has a method to get the name of their products. We can use the $parents array to obtain the reference to the category: <ul data-bind="foreach: {data: categories}"> <li data-bind="text: $data.name"></li> <ul data-bind="foreach: {data: $data.products, as: 'prod'}>    <li data-bind="text:       $parents[0].getProductName(prod.ID)"></li> </ul> </ul> Look how helpful the as attribute is inside the foreach binding. It makes code more readable. But if you are inside a foreach loop, you can also access each item using the $data magic variable, and you can access the position index that each element has in the collection using the $index magic variable. For example, if we have a list of products, we can do this: <ul data-bind="foreach: cart"> <li><span data-bind="text:$index">    </span> - <span data-bind="text:$data.name"></span> </ul> This should display: 0 – Product 1 1 – Product 2 2 – Product 3 ...  KnockoutJS magic variables to navigate through contexts Now that we know more about what binding variables are, let's go back to our code. We are now going to write the addToCart method. We are going to define the cart items in our js/models folder. Create a file called CartProduct.js and insert the following code in it: //js/models/CartProduct.js var CartProduct = function (product, units) { "use strict";   var _product = product,    _units = ko.observable(units);   var subtotal = ko.computed(function(){    return _product.price() * _units(); });   var addUnit = function () {    var u = _units();    var _stock = _product.stock();    if (_stock === 0) {      return;    } _units(u+1);    _product.stock(--_stock); };   var removeUnit = function () {    var u = _units();    var _stock = _product.stock();    if (u === 0) {      return;    }    _units(u-1);    _product.stock(++_stock); };   return {    product: _product,    units: _units,    subtotal: subtotal,    addUnit : addUnit,    removeUnit: removeUnit, }; }; Each cart product is composed of the product itself and the units of the product we want to buy. We will also have a computed field that contains the subtotal of the line. We should give the object the responsibility for managing its units and the stock of the product. For this reason, we have added the addUnit and removeUnit methods. These methods add one unit or remove one unit of the product if they are called. We should reference this JavaScript file into our index.html file with the other <script> tags. In the viewmodel, we should create a cart array and expose it in the return statement, as we have done earlier: var cart = ko.observableArray([]); It's time to write the addToCart method: var addToCart = function(data) { var item = null; var tmpCart = cart(); var n = tmpCart.length; while(n--) {    if (tmpCart[n].product.id() === data.id()) {      item = tmpCart[n];    } } if (item) {    item.addUnit(); } else {    item = new CartProduct(data,0);    item.addUnit();    tmpCart.push(item);       } cart(tmpCart); }; This method searches the product in the cart. If it exists, it updates its units, and if not, it creates a new one. Since the cart is an observable array, we need to get it, manipulate it, and overwrite it, because we need to access the product object to know if the product is in the cart. Remember that observable arrays do not observe the objects they contain, just the array properties. The add-to-cart-modal template This is a very simple template. We just wrap the code to add a product to a Bootstrap modal: <script type="text/html" id="add-to-catalog-modal"> <div class="modal fade" id="addToCatalogModal">    <div class="modal-dialog">      <div class="modal-content">        <form class="form-horizontal" role="form"           data-bind="with:newProduct">          <div class="modal-header">            <button type="button" class="close"               data-dismiss="modal">              <span aria-hidden="true">&times;</span>              <span class="sr-only">Close</span>            </button><h3>Add New Product to the Catalog</h3>          </div>          <div class="modal-body">            <div class="form-group">              <div class="col-sm-12">                <input type="text" class="form-control"                  placeholder="Name" data-bind="textInput:name">              </div>            </div>            <div class="form-group">              <div class="col-sm-12">                <input type="text" class="form-control"                   placeholder="Price" data-bind="textInput:price">              </div>            </div>            <div class="form-group">              <div class="col-sm-12">                <input type="text" class="form-control"                   placeholder="Stock" data-bind="textInput:stock">              </div>            </div>          </div>          <div class="modal-footer">            <div class="form-group">              <div class="col-sm-12">                <button type="submit" class="btn btn-default"                  data-bind="{click:$parent.addProduct}">                  <i class="glyphicon glyphicon-plus-sign">                  </i> Add Product                </button>              </div>            </div>          </div>        </form>      </div><!-- /.modal-content -->    </div><!-- /.modal-dialog --> </div><!-- /.modal --> </script> The cart-widget template This template gives the user information quickly about how many items are in the cart and how much all of them cost: <script type="text/html" id="cart-widget"> Total Items: <span data-bind="text:totalItems"></span> Price: <span data-bind="text:grandTotal"></span> </script> We should define totalItems and grandTotal in our viewmodel: var totalItems = ko.computed(function(){ var tmpCart = cart(); var total = 0; tmpCart.forEach(function(item){    total += parseInt(item.units(),10); }); return total; }); var grandTotal = ko.computed(function(){ var tmpCart = cart(); var total = 0; tmpCart.forEach(function(item){    total += (item.units() * item.product.price()); }); return total; }); Now you should expose them in the return statement, as we always do. Don't worry about the format now, you will learn how to format currency or any kind of data in the future. Now you must focus on learning how to manage information and how to show it to the user. The cart-item template The cart-item template displays each line in the cart: <script type="text/html" id="cart-item"> <div class="list-group-item" style="overflow: hidden">    <button type="button" class="close pull-right" data-bind="click:$root.removeFromCart"><span>&times;</span></button>    <h4 class="" data-bind="text:product.name"></h4>    <div class="input-group cart-unit">      <input type="text" class="form-control" data-bind="textInput:units" readonly/>        <span class="input-group-addon">          <div class="btn-group-vertical">            <button class="btn btn-default btn-xs"               data-bind="click:addUnit">              <i class="glyphicon glyphicon-chevron-up"></i>            </button>            <button class="btn btn-default btn-xs"               data-bind="click:removeUnit">              <i class="glyphicon glyphicon-chevron-down"></i>            </button>          </div>        </span>    </div> </div> </script> We set an x button in the top-right of each line to easily remove a line from the cart. As you can see, we have used the $root magic variable to navigate to the top context because we are going to use this template inside a foreach loop, and it means this template will be in the loop context. If we consider this template as an isolated element, we can't be sure how deep we are in the context navigation. To be sure, we go to the right context to call the removeFormCart method. It's better to use $root instead of $parent in this case. The code for removeFromCart should lie in the viewmodel context and should look like this: var removeFromCart = function (data) { var units = data.units(); var stock = data.product.stock(); data.product.stock(units+stock); cart.remove(data); }; Notice that in the addToCart method, we get the array that is inside the observable. We did that because we need to navigate inside the elements of the array. In this case, Knockout observable arrays have a method called remove that allows us to remove the object that we pass as a parameter. If the object is in the array, it will be removed. Remember that the data context is always passed as the first parameter in the function we use in the click events. The cart template The cart template should display the layout of the cart: <script type="text/html" id="cart"> <button type="button" class="close pull-right"     data-bind="click:hideCartDetails">    <span>&times;</span> </button> <h1>Cart</h1> <div data-bind="template: {name: 'cart-item', foreach:cart}"     class="list-group"></div> <div data-bind="template:{name:'cart-widget'}"></div> <button class="btn btn-primary btn-sm"     data-bind="click:showOrder">    Confirm Order </button> </script> It's important that you notice the template binding that we have just below <h1>Cart</h1>. We are binding a template with an array using the foreach argument. With this binding, Knockout renders the cart-item template for each element inside the cart collection. This considerably reduces the code we write in each template and in addition makes them more readable. We have once again used the cart-widget template to show the total items and the total amount. This is one of the good features of templates, we can reuse content over and over. Observe that we have put a button at the top-right of the cart to close it when we don't need to see the details of our cart, and the other one to confirm the order when we are done. The code in our viewmodel should be as follows: var hideCartDetails = function () { $("#cartContainer").addClass("hidden"); }; var showOrder = function () { $("#catalogContainer").addClass("hidden"); $("#orderContainer").removeClass("hidden"); }; As you can see, to show and hide elements we use jQuery and CSS classes from the Bootstrap framework. The hidden class just adds the display: none style to the elements. We just need to toggle this class to show or hide elements in our view. Expose these two methods in the return statement of your view-model. We will come back to this when we need to display the order template. This is the result once we have our catalog and our cart:   The order template Once we have clicked on the Confirm Order button, the order should be shown to us, to review and confirm if we agree. <script type="text/html" id="order"> <div class="col-xs-12">    <button class="btn btn-sm btn-primary"       data-bind="click:showCatalog">      Back to catalog    </button>    <button class="btn btn-sm btn-primary"       data-bind="click:finishOrder">      Buy & finish    </button> </div> <div class="col-xs-6">    <table class="table">      <thead>      <tr>        <th>Name</th>        <th>Price</th>        <th>Units</th>        <th>Subtotal</th>      </tr>      </thead>      <tbody data-bind="foreach:cart">      <tr>        <td data-bind="text:product.name"></td>        <td data-bind="text:product.price"></td>        <td data-bind="text:units"></td>        <td data-bind="text:subtotal"></td>      </tr>      </tbody>      <tfoot>      <tr>        <td colspan="3"></td>        <td>Total:<span data-bind="text:grandTotal"></span></td>      </tr>      </tfoot>    </table> </div> </script> Here we have a read-only table with all cart lines and two buttons. One is to confirm, which will show the modal dialog saying the order is completed, and the other gives us the option to go back to the catalog and keep on shopping. There is some code we need to add to our viewmodel and expose to the user: var showCatalog = function () { $("#catalogContainer").removeClass("hidden"); $("#orderContainer").addClass("hidden"); }; var finishOrder = function() { cart([]); hideCartDetails(); showCatalog(); $("#finishOrderModal").modal('show'); }; As we have done in previous methods, we add and remove the hidden class from the elements we want to show and hide. The finishOrder method removes all the items of the cart because our order is complete; hides the cart and shows the catalog. It also displays a modal that gives confirmation to the user that the order is done.  Order details template The finish-order-modal template The last template is the modal that tells the user that the order is complete: <script type="text/html" id="finish-order-modal"> <div class="modal fade" id="finishOrderModal">    <div class="modal-dialog">            <div class="modal-content">        <div class="modal-body">        <h2>Your order has been completed!</h2>        </div>        <div class="modal-footer">          <div class="form-group">            <div class="col-sm-12">              <button type="submit" class="btn btn-success"                 data-dismiss="modal">Continue Shopping              </button>            </div>          </div>        </div>      </div><!-- /.modal-content -->    </div><!-- /.modal-dialog --> </div><!-- /.modal --> </script> The following screenshot displays the output:   Handling templates with if and ifnot bindings You have learned how to show and hide templates with the power of jQuery and Bootstrap. This is quite good because you can use this technique with any framework you want. The problem with this type of code is that since jQuery is a DOM manipulation library, you need to reference elements to manipulate them. This means you need to know over which element you want to apply the action. Knockout gives us some bindings to hide and show elements depending on the values of our view-model. Let's update the show and hide methods and the templates. Add both the control variables to your viewmodel and expose them in the return statement. var visibleCatalog = ko.observable(true); var visibleCart = ko.observable(false); Now update the show and hide methods: var showCartDetails = function () { if (cart().length > 0) {    visibleCart(true); } };   var hideCartDetails = function () { visibleCart(false); };   var showOrder = function () { visibleCatalog(false); };   var showCatalog = function () { visibleCatalog(true); }; We can appreciate how the code becomes more readable and meaningful. Now, update the cart template, the catalog template, and the order template. In index.html, consider this line: <div class="row" id="catalogContainer"> Replace it with the following line: <div class="row" data-bind="if: visibleCatalog"> Then consider the following line: <div id="cartContainer" class="col-xs-6 well hidden"   data-bind="template:{name:'cart'}"></div> Replace it with this one: <div class="col-xs-6" data-bind="if: visibleCart"> <div class="well" data-bind="template:{name:'cart'}"></div> </div> It is important to know that the if binding and the template binding can't share the same data-bind attribute. This is why we go from one element to two nested elements in this template. In other words, this example is not allowed: <div class="col-xs-6" data-bind="if:visibleCart,   template:{name:'cart'}"></div> Finally, consider this line: <div class="row hidden" id="orderContainer"   data-bind="template:{name:'order'}"> Replace it with this one: <div class="row" data-bind="ifnot: visibleCatalog"> <div data-bind="template:{name:'order'}"></div> </div> With the changes we have made, showing or hiding elements now depends on our data and not on our CSS. This is much better because now we can show and hide any element we want using the if and ifnot binding. Let's review, roughly speaking, how our files are now: We have our index.html file that has the main container, templates, and libraries: <!DOCTYPE html> <html> <head> <title>KO Shopping Cart</title> <meta name="viewport" content="width=device-width,     initial-scale=1"> <link rel="stylesheet" type="text/css"     href="css/bootstrap.min.css"> <link rel="stylesheet" type="text/css" href="css/style.css"> </head> <body>   <div class="container-fluid"> <div class="row" data-bind="if: visibleCatalog">    <div class="col-xs-12"       data-bind="template:{name:'header'}"></div>    <div class="col-xs-6"       data-bind="template:{name:'catalog'}"></div>    <div class="col-xs-6" data-bind="if: visibleCart">      <div class="well" data-bind="template:{name:'cart'}"></div>    </div> </div> <div class="row" data-bind="ifnot: visibleCatalog">    <div data-bind="template:{name:'order'}"></div> </div> <div data-bind="template: {name:'add-to-catalog-modal'}"></div> <div data-bind="template: {name:'finish-order-modal'}"></div> </div>   <!-- templates --> <script type="text/html" id="header"> ... </script> <script type="text/html" id="catalog"> ... </script> <script type="text/html" id="add-to-catalog-modal"> ... </script> <script type="text/html" id="cart-widget"> ... </script> <script type="text/html" id="cart-item"> ... </script> <script type="text/html" id="cart"> ... </script> <script type="text/html" id="order"> ... </script> <script type="text/html" id="finish-order-modal"> ... </script> <!-- libraries --> <script type="text/javascript"   src="js/vendors/jquery.min.js"></script> <script type="text/javascript"   src="js/vendors/bootstrap.min.js"></script> <script type="text/javascript"   src="js/vendors/knockout.debug.js"></script> <script type="text/javascript"   src="js/models/product.js"></script> <script type="text/javascript"   src="js/models/cartProduct.js"></script> <script type="text/javascript" src="js/viewmodel.js"></script> </body> </html> We also have our viewmodel.js file: var vm = (function () { "use strict"; var visibleCatalog = ko.observable(true); var visibleCart = ko.observable(false); var catalog = ko.observableArray([...]); var cart = ko.observableArray([]); var newProduct = {...}; var totalItems = ko.computed(function(){...}); var grandTotal = ko.computed(function(){...}); var searchTerm = ko.observable(""); var filteredCatalog = ko.computed(function () {...}); var addProduct = function (data) {...}; var addToCart = function(data) {...}; var removeFromCart = function (data) {...}; var showCartDetails = function () {...}; var hideCartDetails = function () {...}; var showOrder = function () {...}; var showCatalog = function () {...}; var finishOrder = function() {...}; return {    searchTerm: searchTerm,    catalog: filteredCatalog,    cart: cart,    newProduct: newProduct,    totalItems:totalItems,    grandTotal:grandTotal,    addProduct: addProduct,    addToCart: addToCart,    removeFromCart:removeFromCart,    visibleCatalog: visibleCatalog,    visibleCart: visibleCart,    showCartDetails: showCartDetails,    hideCartDetails: hideCartDetails,    showOrder: showOrder,    showCatalog: showCatalog,    finishOrder: finishOrder }; })(); ko.applyBindings(vm); It is useful to debug to globalize the view-model. It is not good practice in production environments, but it is good when you are debugging your application. Window.vm = vm; Now you have easy access to your view-model from the browser debugger or from your IDE debugger. In addition to the product model, we have created a new model called CartProduct: var CartProduct = function (product, units) { "use strict"; var _product = product,    _units = ko.observable(units); var subtotal = ko.computed(function(){...}); var addUnit = function () {...}; var removeUnit = function () {...}; return {    product: _product,    units: _units,    subtotal: subtotal,    addUnit : addUnit,    removeUnit: removeUnit }; }; You have learned how to manage templates with Knockout, but maybe you have noticed that having all templates in the index.html file is not the best approach. We are going to talk about two mechanisms. The first one is more home-made and the second one is an external library used by lots of Knockout developers, created by Jim Cowart, called Knockout.js-External-Template-Engine (https://github.com/ifandelse/Knockout.js-External-Template-Engine). Managing templates with jQuery Since we want to load templates from different files, let's move all our templates to a folder called views and make one file per template. Each file will have the same name the template has as an ID. So if the template has the ID, cart-item, the file should be called cart-item.html and will contain the full cart-item template: <script type="text/html" id="cart-item"></script>  The views folder with all templates Now in the viewmodel.js file, remove the last line (ko.applyBindings(vm)) and add this code: var templates = [ 'header', 'catalog', 'cart', 'cart-item', 'cart-widget', 'order', 'add-to-catalog-modal', 'finish-order-modal' ];   var busy = templates.length; templates.forEach(function(tpl){ "use strict"; $.get('views/'+ tpl + '.html').then(function(data){    $('body').append(data);    busy--;    if (!busy) {      ko.applyBindings(vm);    } }); }); This code gets all the templates we need and appends them to the body. Once all the templates are loaded, we call the applyBindings method. We should do it this way because we are loading templates asynchronously and we need to make sure that we bind our view-model when all templates are loaded. This is good enough to make our code more maintainable and readable, but is still problematic if we need to handle lots of templates. Further more, if we have nested folders, it becomes a headache listing all our templates in one array. There should be a better approach. Managing templates with koExternalTemplateEngine We have seen two ways of loading templates, both of them are good enough to manage a low number of templates, but when lines of code begin to grow, we need something that allows us to forget about template management. We just want to call a template and get the content. For this purpose, Jim Cowart's library, koExternalTemplateEngine, is perfect. This project was abandoned by the author in 2014, but it is still a good library that we can use when we develop simple projects. We just need to download the library in the js/vendors folder and then link it in our index.html file just below the Knockout library. <script type="text/javascript" src="js/vendors/knockout.debug.js"></script> <script type="text/javascript"   src="js/vendors/koExternalTemplateEngine_all.min.js"></script> Now you should configure it in the viewmodel.js file. Remove the templates array and the foreach statement, and add these three lines of code: infuser.defaults.templateSuffix = ".html"; infuser.defaults.templateUrl = "views"; ko.applyBindings(vm); Here, infuser is a global variable that we use to configure the template engine. We should indicate which suffix will have our templates and in which folder they will be. We don't need the <script type="text/html" id="template-id"></script> tags any more, so we should remove them from each file. So now everything should be working, and the code we needed to succeed was not much. KnockoutJS has its own template engine, but you can see that adding new ones is not difficult. If you have experience with other template engines such as jQuery Templates, Underscore, or Handlebars, just load them in your index.html file and use them, there is no problem with that. This is why Knockout is beautiful, you can use any tool you like with it. You have learned a lot of things in this article, haven't you? Knockout gives us the CSS binding to activate and deactivate CSS classes according to an expression. We can use the style binding to add CSS rules to elements. The template binding helps us to manage templates that are already loaded in the DOM. We can iterate along collections with the foreach binding. Inside a foreach, Knockout gives us some magic variables such as $parent, $parents, $index, $data, and $root. We can use the binding as along with the foreach binding to get an alias for each element. We can show and hide content using just jQuery and CSS. We can show and hide content using the bindings: if, ifnot, and visible. jQuery helps us to load Knockout templates asynchronously. You can use the koExternalTemplateEngine plugin to manage templates in a more efficient way. The project is abandoned but it is still a good solution. Summary In this article, you have learned how to split an application using templates that share the same view-model. Now that we know the basics, it would be interesting to extend the application. Maybe we can try to create a detailed view of the product, or maybe we can give the user the option to register where to send the order. Resources for Article: Further resources on this subject: Components [article] Web Application Testing [article] Top features of KnockoutJS [article]
Read more
  • 0
  • 0
  • 11034
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-angularjs-performance
Packt
04 Mar 2015
20 min read
Save for later

AngularJS Performance

Packt
04 Mar 2015
20 min read
In this article by Chandermani, the author of AngularJS by Example, we focus our discussion on the performance aspect of AngularJS. For most scenarios, we can all agree that AngularJS is insanely fast. For standard size views, we rarely see any performance bottlenecks. But many views start small and then grow over time. And sometimes the requirement dictates we build large pages/views with a sizable amount of HTML and data. In such a case, there are things that we need to keep in mind to provide an optimal user experience. Take any framework and the performance discussion on the framework always requires one to understand the internal working of the framework. When it comes to Angular, we need to understand how Angular detects model changes. What are watches? What is a digest cycle? What roles do scope objects play? Without a conceptual understanding of these subjects, any performance guidance is merely a checklist that we follow without understanding the why part. Let's look at some pointers before we begin our discussion on performance of AngularJS: The live binding between the view elements and model data is set up using watches. When a model changes, one or many watches linked to the model are triggered. Angular's view binding infrastructure uses these watches to synchronize the view with the updated model value. Model change detection only happens when a digest cycle is triggered. Angular does not track model changes in real time; instead, on every digest cycle, it runs through every watch to compare the previous and new values of the model to detect changes. A digest cycle is triggered when $scope.$apply is invoked. A number of directives and services internally invoke $scope.$apply: Directives such as ng-click, ng-mouse* do it on user action Services such as $http and $resource do it when a response is received from server $timeout or $interval call $scope.$apply when they lapse A digest cycle tracks the old value of the watched expression and compares it with the new value to detect if the model has changed. Simply put, the digest cycle is a workflow used to detect model changes. A digest cycle runs multiple times till the model data is stable and no watch is triggered. Once you have a clear understanding of the digest cycle, watches, and scopes, we can look at some performance guidelines that can help us manage views as they start to grow. (For more resources related to this topic, see here.) Performance guidelines When building any Angular app, any performance optimization boils down to: Minimizing the number of binding expressions and hence watches Making sure that binding expression evaluation is quick Optimizing the number of digest cycles that take place The next few sections provide some useful pointers in this direction. Remember, a lot of these optimization may only be necessary if the view is large. Keeping the page/view small The sanest advice is to keep the amount of content available on a page small. The user cannot interact/process too much data on the page, so remember that screen real estate is at a premium and only keep necessary details on a page. The lesser the content, the lesser the number of binding expressions; hence, fewer watches and less processing are required during the digest cycle. Remember, each watch adds to the overall execution time of the digest cycle. The time required for a single watch can be insignificant but, after combining hundreds and maybe thousands of them, they start to matter. Angular's data binding infrastructure is insanely fast and relies on a rudimentary dirty check that compares the old and the new values. Check out the stack overflow (SO) post (http://stackoverflow.com/questions/9682092/databinding-in-angularjs), where Misko Hevery (creator of Angular) talks about how data binding works in Angular. Data binding also adds to the memory footprint of the application. Each watch has to track the current and previous value of a data-binding expression to compare and verify if data has changed. Keeping a page/view small may not always be possible, and the view may grow. In such a case, we need to make sure that the number of bindings does not grow exponentially (linear growth is OK) with the page size. The next two tips can help minimize the number of bindings in the page and should be seriously considered for large views. Optimizing watches for read-once data In any Angular view, there is always content that, once bound, does not change. Any read-only data on the view can fall into this category. This implies that once the data is bound to the view, we no longer need watches to track model changes, as we don't expect the model to update. Is it possible to remove the watch after one-time binding? Angular itself does not have something inbuilt, but a community project bindonce (https://github.com/Pasvaz/bindonce) is there to fill this gap. Angular 1.3 has added support for bind and forget in the native framework. Using the syntax {{::title}}, we can achieve one-time binding. If you are on Angular 1.3, use it! Hiding (ng-show) versus conditional rendering (ng-if/ng-switch) content You have learned two ways to conditionally render content in Angular. The ng-show/ng-hide directive shows/hides the DOM element based on the expression provided and ng-if/ng-switch creates and destroys the DOM based on an expression. For some scenarios, ng-if can be really beneficial as it can reduce the number of binding expressions/watches for the DOM content not rendered. Consider the following example: <div ng-if='user.isAdmin'>   <div ng-include="'admin-panel.html'"></div></div> The snippet renders an admin panel if the user is an admin. With ng-if, if the user is not an admin, the ng-include directive template is neither requested nor rendered saving us of all the bindings and watches that are part of the admin-panel.html view. From the preceding discussion, it may seem that we should get rid of all ng-show/ng-hide directives and use ng-if. Well, not really! It again depends; for small size pages, ng-show/ng-hide works just fine. Also, remember that there is a cost to creating and destroying the DOM. If the expression to show/hide flips too often, this will mean too many DOMs create-and-destroy cycles, which are detrimental to the overall performance of the app. Expressions being watched should not be slow Since watches are evaluated too often, the expression being watched should return results fast. The first way we can make sure of this is by using properties instead of functions to bind expressions. These expressions are as follows: {{user.name}}ng-show='user.Authorized' The preceding code is always better than this: {{getUserName()}}ng-show = 'isUserAuthorized(user)' Try to minimize function expressions in bindings. If a function expression is required, make sure that the function returns a result quickly. Make sure a function being watched does not: Make any remote calls Use $timeout/$interval Perform sorting/filtering Perform DOM manipulation (this can happen inside directive implementation) Or perform any other time-consuming operation Be sure to avoid such operations inside a bound function. To reiterate, Angular will evaluate a watched expression multiple times during every digest cycle just to know if the return value (a model) has changed and the view needs to be synchronized. Minimizing the deep model watch When using $scope.$watch to watch for model changes in controllers, be careful while setting the third $watch function parameter to true. The general syntax of watch looks like this: $watch(watchExpression, listener, [objectEquality]); In the standard scenario, Angular does an object comparison based on the reference only. But if objectEquality is true, Angular does a deep comparison between the last value and new value of the watched expression. This can have an adverse memory and performance impact if the object is large. Handling large datasets with ng-repeat The ng-repeat directive undoubtedly is the most useful directive Angular has. But it can cause the most performance-related headaches. The reason is not because of the directive design, but because it is the only directive that allows us to generate HTML on the fly. There is always the possibility of generating enormous HTML just by binding ng-repeat to a big model list. Some tips that can help us when working with ng-repeat are: Page data and use limitTo: Implement a server-side paging mechanism when a number of items returned are large. Also use the limitTo filter to limit the number of items rendered. Its syntax is as follows: <tr ng-repeat="user in users |limitTo:pageSize">…</tr> Look at modules such as ngInfiniteScroll (http://binarymuse.github.io/ngInfiniteScroll/) that provide an alternate mechanism to render large lists. Use the track by expression: The ng-repeat directive for performance tries to make sure it does not unnecessarily create or delete HTML nodes when items are added, updated, deleted, or moved in the list. To achieve this, it adds a $$hashKey property to every model item allowing it to associate the DOM node with the model item. We can override this behavior and provide our own item key using the track by expression such as: <tr ng-repeat="user in users track by user.id">…</tr> This allows us to use our own mechanism to identify an item. Using your own track by expression has a distinct advantage over the default hash key approach. Consider an example where you make an initial AJAX call to get users: $scope.getUsers().then(function(users){ $scope.users = users;}) Later again, refresh the data from the server and call something similar again: $scope.users = users; With user.id as a key, Angular is able to determine what elements were added/deleted and moved; it can also determine created/deleted DOM nodes for such elements. Remaining elements are not touched by ng-repeat (internal bindings are still evaluated). This saves a lot of CPU cycles for the browser as fewer DOM elements are created and destroyed. Do not bind ng-repeat to a function expression: Using a function's return value for ng-repeat can also be problematic, depending upon how the function is implemented. Consider a repeat with this: <tr ng-repeat="user in getUsers()">…</tr> And consider the controller getUsers function with this: $scope.getUser = function() {   var orderBy = $filter('orderBy');   return orderBy($scope.users, predicate);} Angular is going to evaluate this expression and hence call this function every time the digest cycle takes place. A lot of CPU cycles were wasted sorting user data again and again. It is better to use scope properties and presort the data before binding. Minimize filters in views, use filter elements in the controller: Filters defined on ng-repeat are also evaluated every time the digest cycle takes place. For large lists, if the same filtering can be implemented in the controller, we can avoid constant filter evaluation. This holds true for any filter function that is used with arrays including filter and orderBy. Avoiding mouse-movement tracking events The ng-mousemove, ng-mouseenter, ng-mouseleave, and ng-mouseover directives can just kill performance. If an expression is attached to any of these event directives, Angular triggers a digest cycle every time the corresponding event occurs and for events like mouse move, this can be a lot. We have already seen this behavior when working with 7 Minute Workout, when we tried to show a pause overlay on the exercise image when the mouse hovers over it. Avoid them at all cost. If we just want to trigger some style changes on mouse events, CSS is a better tool. Avoiding calling $scope.$apply Angular is smart enough to call $scope.$apply at appropriate times without us explicitly calling it. This can be confirmed from the fact that the only place we have seen and used $scope.$apply is within directives. The ng-click and updateOnBlur directives use $scope.$apply to transition from a DOM event handler execution to an Angular execution context. Even when wrapping the jQuery plugin, we may require to do a similar transition for an event raised by the JQuery plugin. Other than this, there is no reason to use $scope.$apply. Remember, every invocation of $apply results in the execution of a complete digest cycle. The $timeout and $interval services take a Boolean argument invokeApply. If set to false, the lapsed $timeout/$interval services does not call $scope.$apply or trigger a digest cycle. Therefore, if you are going to perform background operations that do not require $scope and the view to be updated, set the last argument to false. Always use Angular wrappers over standard JavaScript objects/functions such as $timeout and $interval to avoid manually calling $scope.$apply. These wrapper functions internally call $scope.$apply. Also, understand the difference between $scope.$apply and $scope.$digest. $scope.$apply triggers $rootScope.$digest that evaluates all application watches whereas, $scope.$digest only performs dirty checks on the current scope and its children. If we are sure that the model changes are not going to affect anything other than the child scopes, we can use $scope.$digest instead of $scope.$apply. Lazy-loading, minification, and creating multiple SPAs I hope you are not assuming that the apps that we have built will continue to use the numerous small script files that we have created to separate modules and module artefacts (controllers, directives, filters, and services). Any modern build system has the capability to concatenate and minify these files and replace the original file reference with a unified and minified version. Therefore, like any JavaScript library, use minified script files for production. The problem with the Angular bootstrapping process is that it expects all Angular application scripts to be loaded before the application can bootstrap. We cannot load modules, controllers, or in fact, any of the other Angular constructs on demand. This means we need to provide every artefact required by our app, upfront. For small applications, this is not a problem as the content is concatenated and minified; also, the Angular application code itself is far more compact as compared to the traditional JavaScript of jQuery-based apps. But, as the size of the application starts to grow, it may start to hurt when we need to load everything upfront. There are at least two possible solutions to this problem; the first one is about breaking our application into multiple SPAs. Breaking applications into multiple SPAs This advice may seem counterintuitive as the whole point of SPAs is to get rid of full page loads. By creating multiple SPAs, we break the app into multiple small SPAs, each supporting parts of the overall app functionality. When we say app, it implies a combination of the main (such as index.html) page with ng-app and all the scripts/libraries and partial views that the app loads over time. For example, we can break the Personal Trainer application into a Workout Builder app and a Workout Runner app. Both have their own start up page and scripts. Common scripts such as the Angular framework scripts and any third-party libraries can be referenced in both the applications. On similar lines, common controllers, directives, services, and filters too can be referenced in both the apps. The way we have designed Personal Trainer makes it easy to achieve our objective. The segregation into what belongs where has already been done. The advantage of breaking an app into multiple SPAs is that only relevant scripts related to the app are loaded. For a small app, this may be an overkill but for large apps, it can improve the app performance. The challenge with this approach is to identify what parts of an application can be created as independent SPAs; it totally depends upon the usage pattern of the application. For example, assume an application has an admin module and an end consumer/user module. Creating two SPAs, one for admin and the other for the end customer, is a great way to keep user-specific features and admin-specific features separate. A standard user may never transition to the admin section/area, whereas an admin user can still work on both areas; but transitioning from the admin area to a user-specific area will require a full page refresh. If breaking the application into multiple SPAs is not possible, the other option is to perform the lazy loading of a module. Lazy-loading modules Lazy-loading modules or loading module on demand is a viable option for large Angular apps. But unfortunately, Angular itself does not have any in-built support for lazy-loading modules. Furthermore, the additional complexity of lazy loading may be unwarranted as Angular produces far less code as compared to other JavaScript framework implementations. Also once we gzip and minify the code, the amount of code that is transferred over the wire is minimal. If we still want to try our hands on lazy loading, there are two libraries that can help: ocLazyLoad (https://github.com/ocombe/ocLazyLoad): This is a library that uses script.js to load modules on the fly angularAMD (http://marcoslin.github.io/angularAMD): This is a library that uses require.js to lazy load modules With lazy loading in place, we can delay the loading of a controller, directive, filter, or service script, until the page that requires them is loaded. The overall concept of lazy loading seems to be great but I'm still not sold on this idea. Before we adopt a lazy-load solution, there are things that we need to evaluate: Loading multiple script files lazily: When scripts are concatenated and minified, we load the complete app at once. Contrast it to lazy loading where we do not concatenate but load them on demand. What we gain in terms of lazy-load module flexibility we lose in terms of performance. We now have to make a number of network requests to load individual files. Given these facts, the ideal approach is to combine lazy loading with concatenation and minification. In this approach, we identify those feature modules that can be concatenated and minified together and served on demand using lazy loading. For example, Personal Trainer scripts can be divided into three categories: The common app modules: This consists of any script that has common code used across the app and can be combined together and loaded upfront The Workout Runner module(s): Scripts that support workout execution can be concatenated and minified together but are loaded only when the Workout Runner pages are loaded. The Workout Builder module(s): On similar lines to the preceding categories, scripts that support workout building can be combined together and served only when the Workout Builder pages are loaded. As we can see, there is a decent amount of effort required to refactor the app in a manner that makes module segregation, concatenation, and lazy loading possible. The effect on unit and integration testing: We also need to evaluate the effect of lazy-loading modules in unit and integration testing. The way we test is also affected with lazy loading in place. This implies that, if lazy loading is added as an afterthought, the test setup may require tweaking to make sure existing tests still run. Given these facts, we should evaluate our options and check whether we really need lazy loading or we can manage by breaking a monolithic SPA into multiple smaller SPAs. Caching remote data wherever appropriate Caching data is the one of the oldest tricks to improve any webpage/application performance. Analyze your GET requests and determine what data can be cached. Once such data is identified, it can be cached from a number of locations. Data cached outside the app can be cached in: Servers: The server can cache repeated GET requests to resources that do not change very often. This whole process is transparent to the client and the implementation depends on the server stack used. Browsers: In this case, the browser caches the response. Browser caching depends upon the server sending HTTP cache headers such as ETag and cache-control to guide the browser about how long a particular resource can be cached. Browsers can honor these cache headers and cache data appropriately for future use. If server and browser caching is not available or if we also want to incorporate any amount of caching in the client app, we do have some choices: Cache data in memory: A simple Angular service can cache the HTTP response in the memory. Since Angular is SPA, the data is not lost unless the page refreshes. This is how a service function looks when it caches data: var workouts;service.getWorkouts = function () {   if (workouts) return $q.resolve(workouts);   return $http.get("/workouts").then(function (response){       workouts = response.data;       return workouts;   });}; The implementation caches a list of workouts into the workouts variable for future use. The first request makes a HTTP call to retrieve data, but subsequent requests just return the cached data as promised. The usage of $q.resolve makes sure that the function always returns a promise. Angular $http cache: Angular's $http service comes with a configuration option cache. When set to true, $http caches the response of the particular GET request into a local cache (again an in-memory cache). Here is how we cache a GET request: $http.get(url, { cache: true}); Angular caches this cache for the lifetime of the app, and clearing it is not easy. We need to get hold of the cache dedicated to caching HTTP responses and clear the cache key manually. The caching strategy of an application is never complete without a cache invalidation strategy. With cache, there is always a possibility that caches are out of sync with respect to the actual data store. We cannot affect the server-side caching behavior from the client; consequently, let's focus on how to perform cache invalidation (clearing) for the two client-side caching mechanisms described earlier. If we use the first approach to cache data, we are responsible for clearing cache ourselves. In the case of the second approach, the default $http service does not support clearing cache. We either need to get hold of the underlying $http cache store and clear the cache key manually (as shown here) or implement our own cache that manages cache data and invalidates cache based on some criteria: var cache = $cacheFactory.get('$http');cache.remove("http://myserver/workouts"); //full url Using Batarang to measure performance Batarang (a Chrome extension), as we have already seen, is an extremely handy tool for Angular applications. Using Batarang to visualize app usage is like looking at an X-Ray of the app. It allows us to: View the scope data, scope hierarchy, and how the scopes are linked to HTML elements Evaluate the performance of the application Check the application dependency graph, helping us understand how components are linked to each other, and with other framework components. If we enable Batarang and then play around with our application, Batarang captures performance metrics for all watched expressions in the app. This data is nicely presented as a graph available on the Performance tab inside Batarang: That is pretty sweet! When building an app, use Batarang to gauge the most expensive watches and take corrective measures, if required. Play around with Batarang and see what other features it has. This is a very handy tool for Angular applications. This brings us to the end of the performance guidelines that we wanted to share in this article. Some of these guidelines are preventive measures that we should take to make sure we get optimal app performance whereas others are there to help when the performance is not up to the mark. Summary In this article, we looked at the ever-so-important topic of performance, where you learned ways to optimize an Angular app performance. Resources for Article: Further resources on this subject: Role of AngularJS [article] The First Step [article] Recursive directives [article]
Read more
  • 0
  • 0
  • 5548

article-image-working-vmware-infrastructure
Packt
04 Mar 2015
21 min read
Save for later

Working with VMware Infrastructure

Packt
04 Mar 2015
21 min read
In this article by Daniel Langenhan, the author of VMware vRealize Orchestrator Cookbook, we will take a closer look at how Orchestrator interacts with vCenter Server and vRealize Automation (vRA—formerly known as vCloud Automation Center, vCAC). vRA uses Orchestrator to access and automate infrastructure using Orchestrator plugins. We will take a look at how to make Orchestrator workflows available to vRA. We will investigate the following recipes: Unmounting all the CD-ROMs of all VMs in a cluster Provisioning a VM from a template An approval process for VM provisioning (For more resources related to this topic, see here.) There are quite a lot of plugins for Orchestrator to interact with VMware infrastructure and programs: vCenter Server vCloud Director (vCD) vRealize Automation (vRA—formally known as vCloud Automation Center, vCAC) Site Recovery Manager (SRM) VMware Auto Deploy Horizon (View and Virtual Desktops) vRealize Configuration Manager (earlier known as vCenter Configuration Manager) vCenter Update Manager vCenter Operation Manager, vCOPS (only example packages) VMware, as of writing of this article, is still renaming its products. An overview of all plugins and their names and download links can be found at http://www.vcoteam.info/links/plug-ins.html. There are quite a lot of plugins, and we will not be able to cover all of them, so we will focus on the one that is most used, vCenter. Sadly, vCloud Director is earmarked by VMware to disappear for everyone but service providers, so there is no real need to show any workflow for it. We will also work with vRA and see how it interacts with Orchestrator. vSphere automation The interaction between Orchestrator and vCenter is done using the vCenter API. Here is the explanation of the interaction, which you can refer to in the following figure. A user starts an Orchestrator workflow (1) either in an interactive way via the vSphere Web Client, the Orchestrator Web Operator, the Orchestrator Client, or via the API. The workflow in Orchestrator will then send a job (2) to vCenter and receive a task ID back (type VC:Task). vCenter will then start enacting the job (3). Using the vim3WaitTaskEnd action (4), Orchestrator pauses until the task has been completed. If we do not use the wait task, we can't be certain whether the task has ended or failed. It is extremely important to use the vim3WaitTaskEnd action whenever we send a job to vCenter. When the wait task reports that the job has finished, the workflow will be marked as finished. The vCenter MoRef The MoRef (Managed Object Reference) is a unique ID for every object inside vCenter. MoRefs are basically strings; some examples are shown here: VM Network Datastore ESXi host Data center Cluster vm-301 network-312 dvportgroup-242 datastore-101 host-44 data center-21 domain-c41 The MoRefs are typically stored in the attribute .id or .key of the Orchestrator API object. For example, the MoRef of a vSwitch Network is VC:Network.id. To browse for MoRefs, you can use the Managed Object Browser (MOB), documented at https://pubs.vmware.com/vsphere-55/index.jsp#com.vmware.wssdk.pg.doc/PG_Appx_Using_MOB.20.1.html. The vim3WaitTaskEnd action As already said, vim3WaitTaskEnd is one of the most central actions while interacting with vCenter. The action has the following variables: Category Name Type Usage IN vcTask VC:Task Carries the reconfiguration task from the script to the wait task IN progress Boolean Write to the logs the progress of a task in percentage IN pollRate Number How often the action should be checked for task completion in vCenter OUT ActionResult Any Returns the task's result The wait task will check in regular intervals (pollRate) the status of a task that has been submitted to vCenter. The task can have the following states: State Meaning Queued The task is queued and will be executed as soon as possible. Running The task is currently running. If the progress is set to true, the progress in percentage will be displayed in the logs. Success The task is finished successfully. Error The task has failed and an error will be thrown. Other vCenter wait actions There are actually five waiting tasks that come with the vCenter Server plugin. Here's an overview of the other four: Task Description vim3WaitToolsStarted This task waits until the VMware tools are started on a VM or until a timeout is reached. Vim3WaitForPrincipalIP This task waits until the VMware tools report the primary IP of a VM or until a timeout is reached. This typically indicates that the operating system is ready to receive network traffic. The action will return the primary IP. Vim3WaitDnsNameInTools This task waits until the VMware tools report a given DNS name of a VM or until a timeout is reached. The in-parameter addNumberToName is not used and can be set to Null. WaitTaskEndOrVMQuestion This task waits until a task is finished or if a VM develops a question. A vCenter question is related to user interaction. vRealize Automation (vRA) Automation has changed since the beginning of Orchestrator. Before, tools such as vCloud Director or vCloud Automation Center (vCAC)/vRealize Automation (vRA), Orchestrator was the main tool for automating vCenter resources. With version 6.2 of vCloud Automation Center (vCAC), the product has been renamed vRealize Automation. Now vRA is deemed to become the central cornerstone in the VMware automation effort. vRealize Orchestrator (vRO), is used by vRA to interact with and automate VMware and non-VMware products and infrastructure elements. Throughout the various vCAC/vRA interactions, the role of Orchestrator has changed substantially. Orchestrator started off as an extension to vCAC and became a central part of vRA. In vCAC 5.x, Orchestrator was only an extension of the IaaS life cycle. Orchestrator was tied in using the stubs vCAC 6.0 integrated Orchestrator as an XaaS service (Everything as a Service) using the Advanced Service Designer (ASD) In vCAC 6.1, Orchestrator is used to perform all VMware NSX operations (VMware's new network virtualization and automation), meaning that it became even more of a central part of the IaaS services. With vCAC 6.2, the Advance Service Designer (ASD) was enhanced to allow more complex form of designs, allowing better leverage of Orchestrator workflows. As you can see in the following figure, vRA connects to the vCenter Server using an infrastructure endpoint that allows vRA to conduct basic infrastructure actions, such as power operations, cloning, and so on. It doesn't allow any complex interactions with the vSphere infrastructure, such as HA configurations. Using the Advanced Service Endpoints, vRA integrates the Orchestrator (vRO) plugins as additional services. This allows vRA to offer the entire plugin infrastructure as services to vRA. The vCenter Server, AD, and PowerShell plugins are typical integrations that are used with vRA. Using Advance Service Designer (ASD), you can create integrations that use Orchestrator workflows. ASD allows you to offer Orchestrator workflows as vRA catalog items, making it possible for tenants to access any IT service that can be configured with Orchestrator via its plugins. The following diagram shows an example using the Active Directory plugin. The Orchestrator Plugin provides access to the AD services. By creating a custom resource using the exposed AD infrastructure, we can create a service blueprint and resource actions, both of which are based on Orchestrator workflows that use the AD plugin. The other method of integrating Orchestrator into the IaaS life cycle, which was predominately used in vCAC 5.x was to use the stubs. The build process of a VM has several steps; each step can be assigned a customizable workflow (called a stub). You can configure vRA to run an Orchestrator workflow at these stubs in order to facilitate a few customized actions. Such actions could be taken to change the VMs HA or DRS configuration, or to use the guest integration to install or configure a program on a VM. Installation How to install and configure vRA is out of the scope of this article, but take a look at http://www.kendrickcoleman.com/index.php/Tech-Blog/how-to-install-vcloud-automation-center-vcac-60-part-1-identity-appliance.html for more information. If you don't have the hardware or the time to install vRA yourself, you can use the VMware Hands-on Labs, which can be accessed after clicking on Try for Free at http://hol.vmware.com. The vRA Orchestrator plugin Due to the renaming, the vRA plugin is called vRealize Orchestrator vRA Plug-in 6.2.0, however the file you download and use is named o11nplugin-vcac-6.2.0-2287231.vmoapp. The plugin currently creates a workflow folder called vCloud Automation Center. vRA-integrated Orchestrator The vRA appliance comes with an installed and configured vRO instance; however, the best practice for a production environment is to use a dedicated Orchestrator installation, even better would be an Orchestrator cluster. Dynamic Types or XaaS XaaS means Everything (X) as a Service. The introduction of Dynamic Types in Orchestrator Version 5.5.1 does exactly that; it allows you to build your own plugins and interact with infrastructure that has not yet received its own plugin. Take a look at this article by Christophe Decanini; it integrates Twitter with Orchestrator using Dynamic Types at http://www.vcoteam.info/articles/learn-vco/282-dynamic-types-tutorial-implement-your-own-twitter-plug-in-without-any-scripting.html. Read more… To read more about Orchestrator integration with vRA, please take a look at the official VMware documentation. Please note that the official documentation you need to look at is about vRealize Automation, and not about vCloud Automation Center, but, as of writing this article, the documentation can be found at https://www.vmware.com/support/pubs/vrealize-automation-pubs.html. The document called Advanced Service Design deals with vRO and Advanced Service Designer The document called Machine Extensibility discusses customization using subs Unmounting all the CD-ROMs of all VMs in a cluster This is an easy recipe to start with, but one you can really make it work for your existing infrastructure. The workflow will unmount all CD-ROMs from a running VM. A mounted CD-ROM may block a VM from being vMotioned. Getting ready We need a VM that can mount a CD-ROM either as an ISO from a host or from the client. Before you start the workflow, make sure that the VM is powered on and has an ISO connected to it. How to do it... Create a new workflow with the following variables: Name Type Section Use cluster VC:ClusterComputerResource IN Used to input the cluster clusterVMs Array of VC:VirtualMachine Attribute Use to capture all VMs in a cluster Add the getAllVMsOfCluster action to the schema and assign the cluster in-parameter and the clusterVMs attribute to it as actionResult. Now, add a Foreach element to the schema and assign the workflow Disconnect all detachable devices from a running virtual machine. Assign the Foreach element clusterVMs as a parameter. Save and run the workflow. How it works... This recipe shows how fast and easily you can design solutions that help you with everyday vCenter problems. The problem is that VMs that have CD-ROMs or floppies mounted may experience problems using vMotion, making it impossible for them to be used with DRS. The reality is that a lot of admins mount CD-ROMs and then forget to disconnect them. Scheduling this script every evening just before the nighttime backups will make sure that a production cluster is able to make full use of DRS and is therefore better load-balanced. You can improve this workflow by integrating an exclusion list. See also Refer to the example workflow, 7.01 UnMount CD-ROM from Cluster. Provisioning a VM from a template In this recipe, we will build a deployment workflow for Windows and Linux VMs. We will learn how to create workflows and reduce the amount of input variables. Getting ready We need a Linux or Windows template that we can clone and provision. How to do it… We have split this recipe in two sections. In the first section, we will create a configuration element, and in the second, we will create the workflow. Creating a configuration We will use a configuration for all reusable variables. Build a configuration element that contains the following items: Name Type Use productId String This is the Windows product ID—the licensing code joinDomain String This is the Windows domain FQDN to join domainAdmin Credential These are the credentials to join the domain licenseMode VC:CustomizationLicenseDataMode Example, perServer licenseUsers Number This denotes the number of licensed concurrent users inTimezone Enums:MSTimeZone Time zone fullName String Full name of the user orgName String Organization name newAdminPassword String New admin password dnsServerList Array of String List of DNS servers dnsDomain String DNS domain gateway Array of String List of gateways Creating the base workflow Now we will create the base workflow: Create the workflow as shown in the following figure by adding the given elements:      Clone, Windows with single NIC and credential      Clone, Linux with single NIC      Custom decision Use the Clone, Windows… workflow to create all variables. Link up the ones that you have defined in the configuration as attributes. The rest are defined as follows: Name Type Section Use vmName String IN This is the new virtual machine's name vm VC:VirtualMachine IN Virtual machine to clone folder VC:VmFolder IN This is the virtual machine folder datastore VC:Datastore IN This is the datastore in which you store the virtual machine pool VC:ResourcePool IN This is the resource pool in which you create the virtual machine network VC:Network IN This is the network to which you attach the virtual network interface ipAddress String IN This is the fixed valid IP address subnetMask String IN This is the subnet mask template Boolean Attribute For value No, mark new VM as template powerOn Boolean Attribute For value Yes, power on the VM after creation doSysprep Boolean Attribute For value Yes, run Windows Sysprep dhcp Boolean Attribute For value No, use DHCP newVM VC:VirtualMachine OUT This is the newly-created VM The following sub-workflow in-parameters will be set to special values: Workflow In-parameter value Clone, Windows with single NIC and credential host Null joinWorkgroup Null macAddress Null netBIOS Null primaryWINS Null secondaryWINS Null name vmName clientName vmName Clone, Linux with single NIC host Null macAddress Null name vmName clientName vmName Define the in-parameter VM as input for the Custom decision and add the following script. The script will check whether the name of the OS contains the word Microsoft: guestOS=vm.config.guestFullName; System.log(guestOS);if (guestOS.indexOf("Microsoft") >=0){return true;} else {return false} Save and run the workflow. This workflow will now create a new VM from an existing VM and customize it with a fixed IP. How it works… As you can see, creating workflows to automate vCenter deployments is pretty straightforward. Dealing with the various in-parameters of workflows can be quite overwhelming. The best way to deal with this problem is to hide away variables by defining them centrally using a configuration, or define them locally as attributes. Using configurations has the advantage that you can create them once and reuse them as needed. You can even push the concept a bit further by defining multiple configurations for multiple purposes, such as different environments. While creating a new workflow for automation, a typical approach is as follows: Look for a workflow that you need. Run the workflow normally to check out what it actually does. Either create a new workflow that uses the original or duplicate and edit the one you tried, modifying it until it does what you want. A fast way to deal with a lot of variables is to drag every element you need into the schema and then use the binding to create the variables as needed. You may have noticed that this workflow only lets you select vSwitch networks, not distributed vSwitch networks. You can improve this workflow with the following features: Read the existing Sysprep information stored in your vCenter Server Generate different predefined configurations (for example DEV or Prod) There's more... We can improve the workflow by implementing the ability to change the vCPU and the memory of the VM. Follow these steps to implement it: Move the out-parameter newVM to be an attribute. Add the following variables: Name Type Section Use vCPU Number IN This variable denotes the amount of vCPUs Memory Number IN This variable denotes the amount of VM memory vcTask VC:Task Attribute This variable will carry the reconfiguration task from the script to the wait task progress Boolean Attribute Value NO, vim3WaitTaskEnd pollRate Number Attribute Value 5, vim3WaitTaskEnd ActionResult Any Attribute vim3WaitTaskEnd Add the following actions and workflows according to the next figure:      shutdownVMAndForce      changeVMvCPU      vim3WaitTaskEnd      changeVMRAM      Start virtual machine Bind newVM to all the appropriate input parameters of the added actions and workflows. Bind actionResults (VC:tasks) of the change actions to vim3WaitTasks. See also Refer to the example workflows, 7.02.1 Provision VM (Base), 7.02.2 Provision VM (HW custom), as well as the configuration element, 7 VM provisioning. An approval process for VM provisioning In this recipe, we will see how to create a workflow that waits for an approver to approve the VM creation before provisioning it. We will learn how to combine mail and external events in a workflow to make it interact with different users. Getting ready For this recipe, we first need the provisioning workflow that we have created in the Provisioning a VM from a template recipe. You can use the example workflow, 7.02.1 Provision VM (Base). Additionally, we need a functional e-mail system as well as a workflow to send e-mails. You can use the example workflow, 4.02.1 SendMail as well as its configuration item, 4.2.1 Working with e-mail. How to do it… We will split this recipe in three parts. First, we will create a configuration element then, we will create the workflow, and lastly, we will use a presentation to make the workflow usable. Creating a configuration element We will use a configuration for all reusable variables. Build a configuration element that contains the following items: Name Type Use templates Array/VC:VirtualMachine This contains all the VMs that serve as templates folders Array/VC:VmFolder This contains all the VM folders that are targets for VM provisioning networks Array/VC:Network This contains all VM networks that are targets for VM provisioning resourcePools Array/VC:ResourcePool This contains all resource pools that are targets for VM provisioning datastores Array/VC:Datastore This contains all datastores that are targets for VM provisioning daysToApproval Number These are the number of days the approval should be available for approver String This is the e-mail of the approver Please note that you also have to define or use the configuration elements for SendMail, as well as the Provision VM workflows. You can use the examples contained in the example package. Creating a workflow Create a new workflow and add the following variables: Name Type Section Use mailRequester String IN This is the e-mail address of the requester vmName String IN This is the name of the new virtual machine vm VC:VirtualMachine IN This is the virtual machine to be cloned folder VC:VmFolder IN This is the virtual machine folder datastore VC:Datastore IN This is the datastore in which you store the virtual machine pool VC:ResourcePool IN This is the resource pool in which you create the virtual machine network VC:Network IN This is the network to which you attach the virtual network interface ipAddress String IN This is the fixed valid IP address subnetMask String IN This is the subnet mask isExternalEvent Boolean Attribute A value of true defines this event as external mailApproverSubject String Attribute This is the subject line of the mail sent to the approver mailApproverContent String Attribute This is the content of the mail that is sent to the approver mailRequesterSubject String Attribute This is the subject line of the mail sent to the requester when the VM is provisioned mailRequesterContent String Attribute This is the content of the mail that is sent to the requester when the VM is provisioned mailRequesterDeclinedSubject String Attribute This is the subject line of the mail sent to the requester when the VM is declined mailRequesterDeclinedContent String Attribute This is the content of the mail that is sent to the requester when the VM is declined eventName String Attribute This is the name of the external event endDate Date Attribute This is the end date for the wait of external event approvalSuccess Boolean Attribute This checks whether the VM has been approved Now add all the attributes we defined in the configuration element and link them to the configuration. Create the workflow as shown in the following figure by adding the given elements:      Scriptable task      4.02.1 SendMail (example workflow)       Wait for custom event       Decision       Provision VM (example workflow) Edit the scriptable task and bind the following variables to it: In Out vmName ipAddress mailRequester template approver days to approval mailApproverSubject mailApproverContent mailRequesterSubject mailRequesterContent mailRequesterDeclinedSubject mailRequesterDeclinedContent eventName endDate Add the following script to the scriptable task: //construct event name eventName="provision-"+vmName; //add days to today for approval var today = new Date(); var endDate = new Date(today); endDate.setDate(today.getDate()+daysToApproval); //construct external URL for approval var myURL = new URL() ; myURL=System.customEventUrl(eventName, false); externalURL=myURL.url; //mail to approver mailApproverSubject="Approval needed: "+vmName; mailApproverContent="Dear Approver,n the user "+mailRequester+" would like to provision a VM from template "+template.name+".n To approve please click here: "+externalURL; //VM provisioned mailRequesterSubject="VM ready :"+vmName; mailRequesterContent="Dear Requester,n the VM "+vmName+" has been provisioned and is now available under IP :"+ipAddress; //declined mailRequesterDeclinedSubject="Declined :"+vmName; mailRequesterDeclinedContent="Dear Requester,n the VM "+vmName+" has been declined by "+approver; Bind the out-parameter of Wait for customer event to approvalSuccess. Configure the Decision element with approvalSuccess as true. Bind all the other variables to the workflow elements. Improving with the presentation We will now edit the workflow's presentation in order to make it workable for the requester. To do so, follow the given steps: Click on Presentation and follow the steps to alter the presentation, as seen in the following screenshot: Add the following properties to the in-parameters: In-parameter Property Value template Predefined list of elements #templates folder Predefined list of elements #folders datastore Predefined list of elements #datastores pool Predefined list of elements #resourcePools network Predefined list of elements #networks You can now use the General tab of each in-parameter to change the displayed text. Save and close the workflow. How it works… This is a very simplified example of an approval workflow to create VMs. The aim of this recipe is to introduce you to the method and ideas of how to build such a workflow. This workflow will only give a requester the choices that are configured in the configuration element, making the workflow quite safe for users that have only limited knowhow of the IT environment. When the requester submits the workflow, an e-mail is sent to the approver. The e-mail contains a link, which when clicked, triggers the external event and approves the VM. If the VM is approved it will get provisioned, and when the provisioning has finished an e-mail is sent to the requester stating that the VM is now available. If the VM is not approved within a certain timeframe, the requester will receive an e-mail that the VM was not approved. To make this workflow fully functional, you can add permissions for a requester group to the workflow and Orchestrator so that the user can use the vCenter to request a VM. Things you can do to improve the workflow are as follows: Schedule the provisioning to a future date. Use the resources for the e-mail and replace the content. Add an error workflow in case the provisioning fails. Use AD to read out the current user's e-mail and full name to improve the workflow. Create a workflow that lets an approver configure the configuration elements that a requester can chose from. Reduce the selections by creating, for instance, a development and production configuration that contains the correct folders, datastores, networks, and so on. Create a decommissioning workflow that is automatically scheduled so that the VM is destroyed automatically after a given period of time. See also Refer to the example workflow, 7.03 Approval and the configuration element, 7 approval. Summary In this article, we discussed one of the important aspects of the interaction of Orchestrator with vCenter Server and vRealize Automation, that is VM provisioning. Resources for Article: Further resources on this subject: Importance of Windows RDS in Horizon View [article] Metrics in vRealize Operations [article] Designing and Building a Horizon View 6.0 Infrastructure [article]
Read more
  • 0
  • 0
  • 13128

article-image-native-ms-security-tools-and-configuration
Packt
04 Mar 2015
19 min read
Save for later

Native MS Security Tools and Configuration

Packt
04 Mar 2015
19 min read
This article, written by Santhosh Sivarajan, the author of Getting Started with Windows Server Security, will introduce another powerful Microsoft tool called Microsoft Security Compliance Manager (SCM). As its name suggests, it is a platform for managing and maintaining your security and compliance polices. At this point, we have established baseline security based on your business requirement, using Microsoft SCW. These polices can be a pure reflection of your business requirements. However, in an enterprise world, you have to consider compliance, regulations, other industry standards, and best practices to maximize the effectiveness of the security policy. That's where Microsoft SCM can provide more business value. We will talk more about the included SCM baselines later in the article. The goal of the article is to walk you through the configuration and administration process of Microsoft SCM and explain how it can be used in an enterprise environment to support your security needs. Then we will talk about a method to maintain the desired state of the server using a Microsoft tool called Attack Surface Analyzer (ASA). At the end of the article, you will see an option to add more security restrictions using another Microsoft tool called AppLocker. (For more resources related to this topic, see here.) Microsoft SCM Microsoft SCM is a centralized security and compliance policy manager product from Microsoft. It is a standalone application. Microsoft develops these baselines and best practice recommendations based on customer feedback and other agency's recommendations. These polices are consistently reviewed and updated. So, it is important that you are using the latest policy baseline. If there is a new policy, you will be able to download and update the baseline from the Microsoft SCM console itself. Since Microsoft SCM supports multiple input and output formats such as XML, Group Policy Objects (GPO), Desired Configuration Management (DCM), Security Content Automation Protocol (SCAP), and so on, it can be a centralized platform for your network infrastructure and other security and compliance products. It is also possible to integrate SCM with Microsoft System Center 2012 Process Pack for IT GRC. More details can be found at http://technet.microsoft.com/en-us/library/dd206732.aspx. Installing Microsoft SCM We will start with the installation process. As mentioned earlier, it is a standalone product. It uses Microsoft SQL Server 2008 or higher as the database. If you don't have a SQL database already installed on your system, the SCM installation process will automatically install Microsoft SQL Server 2008 Express Edition. You can perform the following steps to install Microsoft SCM: Download Microsoft Security Compliance Manager from http://www.microsoft.com/en-us/download/details.aspx?id=16776. Double-click on Security_Compliance_Manager_Setup.exe to start the installation process. Click on Next on the welcome window. Make sure to select the Always check for SCM and baseline updates option. Accept the License Agreement option and click on Next. Select the installation folder from the Installation Folder window by clicking on the Browse button. Click on Next. On the Microsoft SQL Server 2008 Express window, click on Next to install Microsoft SQL Server 2008 Express Edition. If you have Microsoft SQL Server already installed on your system, you can select the correct server details from this window. Accept the License Agreement option for SQL Server 2008 Express and click on Next. Click on Install on the Ready to Install window to begin the installation. You will see the progress in the Installing the Microsoft Security Compliance Manager window. If it asks you to restart the computer, click on OK. Click on Finish to complete the installation. This section provides a high level overview of the product before starting the administration and management process. The left pane of the SCMconsole provides the list of all available baselines. This is the baseline library inside SCM. The center pane displays more information based on your policy section from the baseline library. The right pane, also called the Actions pane, provides commands and options to manage your policies. As you can see in the following screenshot, it provides a few options to export these policies into different formats. So, if you have a different compliance manager tool, you can use these files with your existing tool.  SCM – Export options In compliance with other products, Microsoft SCM supports different severity levels—critical, optional, important, and none. As you can see in the following screenshot, on a custom policy, the severity levels can be changed to None, Important, Optional, or Critical based on your requirements:   For each of these events, you will see additional details and reference articles (CCE, OVAL, and so on) in the Setting Details section. Administering Microsoft SCM This section provides you with an overview of Microsoft SCM and some administration procedures to create and manage policies. These tasks can be achieved by performing the following steps: Open Security Compliance Manager. If you see a Download Updates popup window, click on the Download button to start the download and complete the database update process. Security Compliance Manager consists of mainly two sections: Custom Baselines and Microsoft Baselines. We will go through the details later in this article. SCM - Baselines Expand Microsoft Baselines. Since we are focusing more on Windows Server 2012, I will start with this section. Select the Windows Server 2012 node. This node contains predefined security polices based on Microsoft and industry best practices. I will use the predefined WS2012 Web Server Security template for this exercise. You will not be able to make changes to the settings in the default template. If you need to make changes, you can make a copy of the template and make changes there. Select the WS2012 Web Server Security template. From the right pane, select the Duplicate option. In the Duplicate window, enter the name for this new security policy. Click on Save. The new template will be saved under the Custom Baselines node. You can review the policy and make necessary changes in the newly created policy. Creating and implementing security policies At this point, you have installed SCM and are familiar with the basic administration tasks. From this section onwards, you will be working on a real-world scenario where you will be exporting a policy from Active Directory, importing into SCM, merging with an SCM baseline, and importing back into Active Directory. In this section, our goal is to export this web server policy and merge it with an SCM baseline and import it back into Active Directory. Exporting GPO from Active Directory We will start by exporting the existing web server policy from Active Directory. The following steps can be performed to export (backup) an Active Directory GPO-based policy: Open the Group Policy Manager console. Expand Forest | Domain | Domain Name | Group Policy Objects. Right-click on the appropriate GPO and select Back Up. GPO – Back up In the Back Up Group Policy Object window, enter the Location and Description details for the backup file. Click on the Back Up button to start the backup operation. You will see the progress in the Backup window. Click on OK when it completes the backup operation. GPO can also be backed up using the Backup-GPO PowerShell cmdlet. The following is an example:Backup-Gpo. Name- "WebServerbaselineV2.0". Path- D:Backup -Comment "Baseline Backup" The backup folder name will be the GUID of the GPO itself. Importing GPO into SCM An exported GPO-based policy can be imported directly into SCM. An administrator can perform the following steps to complete this task: Open Microsoft Compliance Security Manager. From the Import section on the right pane, select the GPO Backup (Folder) option. SCM – Import In the Browse For Folder window, select the GPO backup folder. Click on OK. In the GPO Name window, confirm or change the baseline name. Click on OK. In the SCM Log window, you will see the status. Click on OK to close the window. You will see the imported policy under Custom Baselines | GPO Import | Policy Name. Currently, SCM supports importing from GPO backup and SCM CAB files. If you have some other policy or baseline (for example, DISA STIGs) that you would like to import into SCM, you need to import these polices into Active Directory first, and then export/backup to GPO before you can import into SCM. Merging imported GPO with the SCM baseline policy The third step in this process is to merge the imported policy with the SCM baseline policy. Keep in mind that some configurations and settings will be lost when you merge an existing GPO with the SCM baseline policy. For example, service-related or ACL configurations may not be preserved when you associate and merge with an SCM baseline policy. If you have these types of configuration in your GPO and want to retain them, you may need to split the GPO and use two separate GPOs. Inside the SCM, the import process is to map these configurations with the SCM library to preserve these settings. If it doesn't match or map, these settings will be dropped from the new baseline policy. For this exercise, my assumption is that you don't have a custom configuration or settings in the imported policy. The following steps can be used to Associate and Merge a GPO-based policy into an SCM-based policy: Select the imported policy in Microsoft Compliance Security Manager. From the right pane, select the Associate option from the Baseline section.Selecting the Associate option From the Associate Product with GPO window, select the appropriate baseline policy. Since we are working with a Windows Server 2012 policy, I will be selecting Windows Server 2012 as the product. If you have a different operating system, select the correct policy from the product list. Click on Associate. Your custom policy must have unique settings in the baseline policy in order to associate a custom policy with the SCM baseline policy; otherwise, the Associate button will be grayed out. Enter a name for this policy in the Baseline Policy window. You will see this policy in the Custom Baselines | Windows Server 2012 section. Select this policy. From the right pane, select the Compare/Merge option from the Baseline section. Selecting the Compare / Merge option Now you have associated your policy with an SCM baseline policy. The next step is to compare and merge your policy with a baseline SCM policy. From the Compare Baseline window, select the appropriate baseline policy. Since we are working with a web server baseline, we will be selecting WS2012 Web Server Security 1.0 as the policy. Click on OK. You will see the result in the Compare Baselines window. You can review the differ and match details here. Since we are planning to merge these two polices, we will be selecting the Merge Baselines option. You will see the summary report in the Merge Baselines window. Click on OK. In the Specify a name for the merged baseline window, enter a new name for this policy. Click on OK. This merged policy will be stored in the Custom Baselines– Windows Server 2012 section. Exporting the SCM baseline policy At this point, you have created a new policy that contains your custom policy and best practices provided by SCM. The next step is to export this policy to a supported format. Since we are dealing with Active Directory and GPO, we will be exporting it into a GPO-based policy. You can perform the following steps to export an SCM policy to a GPO-based backup policy: Select the policy from Microsoft Compliance Security Manager. From the Export section, select the GPO Backup (Folder) option. GPO Backup (Folder) From the Browse for Folder window, select the folder to store this policy in. Click on OK. Importing a policy into Active Directory The final step in this process is to import these settings back to Active Directory. This can be achieved by using Group Policy Management Console (GPMC). The following steps can be used to import an SCM-based policy into Active Directory: Open Group Policy Manager Console. Expand Forest | Domain | Domain Name | Group Policy Objects. Right-click on the appropriate policy. Select the Import Settings option. The Import Settings option Click on Next in the Welcome window. It is always a best practice to back up the existing settings. Click on Backup to continue with the backup operation. Once you have completed the backup, click on Next in the Backup GPO window. In the Backup Location window, select the backup location folder. Click on Next. Confirm the GPO name in the Source GPO window. Click on Next. You will see the scanning settings in the Scanning Backup window. Click on Next to continue. Click on Finish in the Completing the Import Settings Wizard window to complete the import operation. Click on OK in the Import window. Maintaining and monitoring the integrity of a baseline policy Once you have baseline security in place, whether it is a true business policy or a combination of business and industry practices, you will need to maintain this state to ensure the security and integrity. The whole idea is to compare your baseline image with the current image in order to validate the settings. There are many ways to achieve this. Microsoft has a free tool called Attack Surface Analyzer (ASA) that can be used to compare the two states of the system. The details and capabilities of this tool can found at http://www.microsoft.com/en-us/download/details.aspx?id=24487. Microsoft ASA An administrator can perform the following steps to install, configure, and generate an Attack Surface Report using Microsoft ASA: Download Attack Surface Analyzer from http://www.microsoft.com/en-us/download/details.aspx?id=24487. Complete the installation. It is a standalone, simple MSI installation process. Open the Attack Surface Analyzer tool. The first step is to create the baseline state. Select the Run New Scan option and enter a name for the CAB file. Click on Run Scan to start the scanning process. You will see the status and progress in the Collecting Data window. When it completes, it will create a CAB file with the result. The second step in this process is to analyze the baseline state against the existing server so as to identify the differences. You will need to create another report (Product CAB) to compare the CAB file with the baseline CAB. Select the Run New Scan option again and enter a name for the product CAB file. Click on Run Scan to start the scanning process. Complete the CAB creation process. The third step in the process is to compare the baseline CAB with the product CAB to get the delta. Select the Generate Standard Attack Surface Report option. In the Select Options section, select the baseline CAB name, select the product CAB name, and enter a name for the attack report. Click on Generate to start the process. You will see the status in the Running Analysis window. The report will be opened automatically in the web browser. This report has three sections: Report Summary, Security Issues, and Attack Surface. The following is an example of a Security Issues report Application control and management At this point, you have a baseline policy for your server platform. Now we can add more restrictions based on your requirements to provide a more secure environment. In the following section, my plan is to introduce an option to "blacklist" and "whitelist" some of the applications using a built-in native option called AppLocker. The details of the AppLocker application can be found at http://technet.microsoft.com/en-us/library/hh831409.aspx. AppLocker AppLocker polices are part of Application Control Policies in GPOs. There are four types of built-in rules: Executable, Windows Installable, Script, and Packed App rules. Before you create or enforce a policy, you need to perform an inventory check to identify the current usage of these applications in your environment. AppLocker has an inventory process called Auditing that helps you to achieve this. In this scenario, our goal is to block unauthorized access of the NLTEST application from all servers. Creating a policy As the first step, you need to identify the current usage of the application in your environment. The following steps can be performed to create a new AppLocker policy in an Active Directory environment: Open Group Policy Manager Console. Expand Forest | Domain | Domain Name. Right-click on the Group Policy Object node and select New. Enter a name for the GPO in the New GPO window. Leave Source Starter GPO as (none). Click on OK. This will create a new blank GPO in the Group Policy Object node. We will be using this GPO to configure the AppLocker settings. Right-click on the newly created GPO and select Edit. This will open the Group Policy Management Editor window. Expand Policies | Windows Settings | Security Settings | AppLocker. Right-click on Executable Polices and select Create Default Rules. These default rules allow users and built-in administrators to run default programs and administrators to run files and applications. Based on your requirements, you can modify and delete these rules. The default AppLocker rule allows everyone to run files located only in the Windows folder, and the administrator can run all files. The default AppLocker rule Expand Policies | Windows Settings | Security Settings | AppLocker. Right-click on Executable Polices and select Create New Rules. Click on Next in the Create Executable Rules window. In the Permission window, select Deny. In the User or Group section, click on Select and select the Server Admins group. Here, I have created a security group with all server administrators in that group. In the Conditions window, select the File Hash option. Click on Next. In the File Hash window, select the correct file name using the Browse File option. In this scenario, I will be selecting the NLTEST.exe file. Click on Next. In the Name and Description window, select or enter an appropriate name for this rule. Click on Create. Auditing a policy The next step in this process is to audit the previously created polices to ensure that there will not be any adverse effects to your environment. An administrator can perform the following steps to audit an existing policy in an Active Directory environment: Right-click on AppLocker (Policies | Windows Settings | Security Settings) and go to Properties. On the Enforcement tab, select appropriate rule types as Configured. From the drop-down list, select the rule as Audit only. Click on OK. GPO – AppLocker policy You can see the application usage and history in the Event log. Open Event Viewer. Navigate to Applications and Services Logs | Microsoft | Windows | AppLocker. Based on your policy configuration, you will see the appropriate event information in the AppLocker section. In an enterprise world, manually checking the items in an event log is not going to be a viable option. You have a few options available to automate this process. You can forward the event log to a central server (Event Forwarding) and verify from that single console, or you can use the Get-WinEvent PowerShell cmdlet to collect these events remotely. The following section provides an option to evaluate these logs using the Get-WinEvent PowerShell cmdlet. By default, AppLocker events are located in the Applications and Services Logs | Microsoft | Windows | AppLocker section of the Event Viewer. The Get-WinEvent -ComputerName "SERVER01.MYINFRALAB.COM" –LogName *AppLocker* | fl | out-file Server01.txt cmdlet filters all AppLocker-related events from Server01 and puts them in the output file Server01.txt. Here are some of the events that you will see in the event log: If you have multiple computers to evaluate, you can create a simple PowerShell script to automatically input the computer names. The following is a sample PowerShell script. The Servers.txt file will be your input file that contains all of the server names: $OutPut = "C:InputOutput.txt" Get-Content "C:InputServers.txt" | Foreach-Object { $_| out-file $OutPut -Append -Encoding ascii Get-WinEvent -ComputerName "Infralab01.MYINFRALAB.COM" –LogName *AppLocker* | fl | out-file $OutPut -Append -Encoding ascii } Implementing the policy Once you have verified the audit result, you can enforce the policy using the AppLockerGPO. The following steps can be used to implement the AppLocker GPO in an Active Directory environment: Open Group Policy Manager Console. Expand the Forest | Domain | Domain Name | Group Policy Object node. Right-click on the Server Application Restriction GPO and select Edit. This will open a Group Policy Management Editor MMC window. Opening the Group Policy Management Editor MMC window From Group Policy Management Editor, expand Policies | Windows Settings | Security Settings. Right-click on AppLocker and select Properties. In the AppLocker Properties window, change Executable rules to Enforce rules. Click on OK: Close the Group Policy Management Editor MMC window. The new policy will apply to the server based on your Active Directory replication interval and GPO refresh cycle. You can use the GPUPDATE/Force command to force the GPOon to a local server. Two different results are shown in the following screenshots. As you can see in the following screenshot, the user Johndoe was denied the execution of the NLTEST.exe application:   Since the following user was part of the Server Admins group, the user was allowed to execute the NLTEST.exe application:   Some additional security recommendations to consider when installing and configuring AppLocker are included at http://technet.microsoft.com/en-us/library/ee844118(WS.10).aspx. AppLocker and PowerShell AppLocker supports PowerShell, and it has a PowerShell module called AppLocker. An administrator can create, test, and troubleshoot the AppLocker policies using these cmdlets. You need to import the AppLocker module before these cmdlets can be used. The following are the supported cmdlets in the module: Summary We started this article with baseline security for your server platform, which was originally created using Microsoft SCW. In this article, you learned how to incorporate this policy with the baseline and best practice recommendations using MicrosoftSCM. Then you used AppLocker to enforce more application-based security. We also learned how to monitor the state of the server and compare it with the baseline to identify the security vulnerabilities and issues using Microsoft ASA. Resources for Article:  Further resources on this subject: Active Directory migration [article] Microsoft DAC 2012 [article] Insight into Hyper-V Storage [article]
Read more
  • 0
  • 0
  • 2075

article-image-introducing-splunk
Packt
03 Mar 2015
14 min read
Save for later

Introducing Splunk

Packt
03 Mar 2015
14 min read
In this article by Betsy Page Sigman, author of the book Splunk Essentials, Splunk, whose "name was inspired by the process of exploring caves, or splunking, helps analysts, operators, programmers, and many others explore data from their organizations by obtaining, analyzing, and reporting on it. This multinational company, cofounded by Michael Baum, Rob Das, and Erik Swan, has a core product called "Splunk Enterprise. This manages searches, inserts, deletes, and filters, and analyzes big data that is generated by machines, as well as other types of data. "They also have a free version that has most of the capabilities of Splunk Enterprise and is an excellent learning tool. (For more resources related to this topic, see here.) Understanding events, event types, and fields in Splunk An understanding of events and event types is important before going further. Events In Splunk, an event is not just one of" the many local user meetings that are set up between developers to help each other out (although those can be very useful), "but also refers to a record of one activity that is recorded in a log file. Each event usually has: A timestamp indicating the date and exact time the event was created Information about what happened on the system that is being tracked Event types An event type is a way to allow "users to categorize similar events. It is field-defined by the user. You can define an event type in several ways, and the easiest way is by using the SplunkWeb interface. One common reason for setting up an event type is to examine why a system has failed. Logins are often problematic for systems, and a search for failed logins can help pinpoint problems. For an interesting example of how to save "a search on failed logins as an event type, visit http://docs.splunk.com/Documentation/Splunk/6.1.3/Knowledge/ClassifyAndGroupSimilarEvents#Save_a_search_as_a_new_event_type. Why are events and event types so important in Splunk? Because without events, there would be nothing to search, of course. And event types allow us to make meaningful searches easily and quickly according to our needs, as we'll see later. Sourcetypes Sourcetypes are also "important to understand, as they help define the rules for an event. A sourcetype is one of the default fields that Splunk assigns to data as it comes into the system. It determines what type of data it is so that Splunk can format it appropriately as it indexes it. This also allows the user who wants to search the "data to easily categorize it. Some of the common sourcetypes are listed as follows: access_combined, for "NCSA combined format HTTP web server logs apache_error, for standard "Apache web server error logs cisco_syslog, for the "standard syslog produced by Cisco network devices (including PIX firewalls, routers, and ACS), usually via remote syslog to a central log host websphere_core, a core file" export from WebSphere (Source: http://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter) Fields Each event in Splunk is" associated with a number of fields. The core fields of host, course, sourcetype, and timestamp are key to Splunk. These fields are extracted from events at multiple points in the data processing pipeline that Splunk uses, and each of these fields includes a name and a value. The name describes the field (such as the userid) and the value says what that field's value is (susansmith, for example). Some of these fields are default fields that are given because of where the event came from or what it is. When data is processed by Splunk, and when it is indexed or searched, it uses these fields. For indexing, the default fields added include those of host, source, and sourcetype. When searching, Splunk is able to select from a bevy of fields that can either be defined by the user or are very basic, such as action results in a purchase (for a website event). Fields are essential for doing the basic work of Splunk – that is, indexing and searching. Getting data into Splunk It's time to spring into action" now and input some data into Splunk. Adding data is "simple, easy, and quick. In this section, we will use some data and tutorials created by Splunk to learn how to add data: Firstly, to obtain your data, visit the tutorial data at http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk that is readily available on Splunk. Here, download the folder tutorialdata.zip. Note that this will be a fresh dataset that has been collected over the last 7 days. Download it but don't extract the data from it just yet. You then need to log in to Splunk, using admin as the username and then by using your password. Once logged in, you will notice that toward the upper-right corner of your screen is the button Add Data, as shown in the following screenshot. Click "on this button: Button to Add Data Once you have "clicked on this button, you'll see a screen" similar to the "following screenshot: Add Data to Splunk by Choosing a Data Type or Data Source Notice here the "different types of data that you can select, as "well as the different data sources. Since the data we're going to use is a file, under "Or Choose a Data Source, click on From files and directories. Once you have clicked on this, you can then click on the radio button next to Skip preview, as indicated in the following screenshot, since you don't need to preview the data" now. You then need to click on "Continue: Preview data You can download the tutorial files at: http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchTutorial/GetthetutorialdataintoSplunk As shown in the next screenshot, click on Upload and index a file, find the tutorialdata.zip file you just downloaded (it is probably in your Downloads folder), and then click on More settings, filling it in as shown in the following screenshot. (Note that you will need to select Segment in path under Host and type 1 under Segment Number.) Click on Save when you are done: Can specify source, additional settings, and source type Following this, you "should see a screen similar to the following" screenshot. Click on Start Searching, we will look at the data now: You should see this if your data has been successfully indexed into Splunk. You will now" see a screen similar to the following" screenshot. Notice that the number of events you have will be different, as will the time of the earliest event. At this point, click on Data Summary: The Search screen You should see the Data Summary screen like in the following screenshot. However, note that the Hosts shown here will not be the same as the ones you get. Take a quick look at what is on the Sources tab and the Sourcetypes tab. Then find the most recent data (in this case 127.0.0.1) and click on it. Data Summary, where you can see Hosts, Sources, and Sourcetypes After" clicking on the most recent data, which in "this case is bps-T341s, look at the events contained there. Later, when we use streaming data, we can see how the events at the top of this list change rapidly. Here, you will see a listing of events, similar to those shown in the "following screenshot: Events lists for the host value You can click on the Splunk logo in the upper-left corner "of the web page to return to the home page. Under Administrator at the "top-right of the page, click on Logout. Searching Twitter data We will start here by doing a simple search of our Twitter index, which is automatically created by the app once you have enabled Twitter input (as explained previously). In our earlier searches, we used the default index (which the tutorial data was downloaded to), so we didn't have to specify the index we wanted to use. Here, we will use just the Twitter index, so we need to specify that in the search. A simple search Imagine that we wanted to search for tweets containing the word coffee. We could use the code presented here and place it in the search bar: index=twitter text=*coffee* The preceding code searches only your Twitter index and finds all the places where the word coffee is mentioned. You have to put asterisks there, otherwise you will only get the tweets with just "coffee". (Note that the text field is not case sensitive, so tweets with either "coffee" or "Coffee" will be included in the search results.) The asterisks are included before and after the text "coffee" because otherwise we would only get events where just "coffee" was tweeted – a rather rare occurrence, we expect. In fact, when we search our indexed Twitter data without the asterisks around coffee, we got no results. Examining the Twitter event Before going further, it is useful to stop and closely examine the events that are collected as part of the search. The sample tweet shown in the following screenshot shows the large number of fields that are part of each tweet. The > was clicked to expand the event: A Twitter event There are several items to look closely at here: _time: Splunk assigns a timestamp for every event. This is done in UTC (Coordinated Universal Time) time format. contributors: The value for this field is null, as are the values of many Twitter fields. Retweeted_status: Notice the {+} here; in the following event list, you will see there are a number of fields associated with this, which can be seen when the + is selected and the list is expanded. This is the case wherever you see a {+} in a list of fields: Various retweet fields In addition to those shown previously, there are many other fields associated with a tweet. The 140 character (maximum) text field that most people consider to be the tweet is actually a small part of the actual data collected. The implied AND If you want to search on more than one term, there is no need to add AND as it is already implied. If, for example, you want to search for all tweets that include both the text "coffee" and the text "morning", then use: index=twitter text=*coffee* text=*morning* If you don't specify text= for the second term and just put *morning*, Splunk assumes that you want to search for *morning* in any field. Therefore, you could get that word in another field in an event. This isn't very likely in this case, although coffee could conceivably be part of a user's name, such as "coffeelover". But if you were searching for other text strings, such as a computer term like log or error, such terms could be found in a number of fields. So specifying the field you are interested in would be very important. The need to specify OR Unlike AND, you must always specify the word OR. For example, to obtain all events that mention either coffee or morning, enter: index=twitter text=*coffee* OR text=*morning* Finding other words used Sometimes you might want to find out what other words are used in tweets about coffee. You can do that with the following search: index=twitter text=*coffee* | makemv text | mvexpand text | top 30 text This search first searches for the word "coffee" in a text field, then creates a multivalued field from the tweet, and then expands it so that each word is treated as a separate piece of text. Then it takes the top 30 words that it finds. You might be asking yourself how you would use this kind of information. This type of analysis would be of interest to a marketer, who might want to use words that appear to be associated with coffee in composing the script for an advertisement. The following screenshot shows the results that appear (1 of 2 pages). From this search, we can see that the words love, good, and cold might be words worth considering: Search of top 30 text fields found with *coffee* When you do a search like this, you will notice that there are a lot of filler words (a, to, for, and so on) that appear. You can do two things to remedy this. You can increase the limit for top words so that you can see more of the words that come up, or you can rerun the search using the following code. "Coffee" (with a capital C) is listed (on the unshown second page) separately here from "coffee". The reason for this is that while the search is not case sensitive (thus both "coffee" and "Coffee" are picked up when you search on "coffee"), the process of putting the text fields through the makemv and the mvexpand processes ends up distinguishing on the basis of case. We could rerun the search, excluding some of the filler words, using the code shown here: index=twitter text=*coffee* | makemv text | mvexpand text |search NOT text="RT" AND NOT text="a" AND NOT text="to" ANDNOT text="the" | top 30 text Using a lookup table Sometimes it is useful to use a lookup file to avoid having to use repetitive code. It would help us to have a list of all the small words that might be found often in a tweet just by the nature of each word's frequent use in language, so that we might eliminate them from our quest to find words that would be relevant for use in the creation of advertising. If we had a file of such small words, we could use a command indicating not to use any of these more common, irrelevant words when listing the top 30 words associated with our search topic of interest. Thus, for our search for words associated with the text "coffee", we would be interested in words like " dark", "flavorful", and "strong", but not words like "a", "the", and "then". We can do this using a lookup command. There are three types of lookup commands, which are presented in the following table: Command Description lookup Matches a value of one field with a value of another, based on a .csv file with the two fields. Consider a lookup table named lutable that contains fields for machine_name and owner. Consider what happens when the following code snippet is used after a preceding search (indicated by . . . |): . . . | lookup lutable owner Splunk will use the lookup table to match the owner's name with its machine_name and add the machine_name to each event. inputlookup All fields in the .csv file are returned as results. If the following code snippet is used, both machine_name and owner would be searched: . . . | inputlookup lutable outputlookup This code outputs search results to a lookup table. The following code outputs results from the preceding research directly into a table it creates: . . . | outputlookup newtable.csv saves The command we will use here is inputlookup, because we want to reference a .csv file we can create that will include words that we want to filter out as we seek to find possible advertising words associated with coffee. Let's call the .csv file filtered_words.csv, and give it just a single text field, containing words like "is", "the", and "then". Let's rewrite the search to look like the following code: index=twitter text=*coffee*| makemv text | mvexpand text| search NOT [inputlookup filtered_words | fields text ]| top 30 text Using the preceding code, Splunk will search our Twitter index for *coffee*, and then expand the text field so that individual words are separated out. Then it will look for words that do NOT match any of the words in our filtered_words.csv file, and finally output the top 30 most frequently found words among those. As you can see, the lookup table can be very useful. To learn more about Splunk lookup tables, go to http://docs.splunk.com/Documentation/Splunk/6.1.5/SearchReference/Lookup. Summary In this article, we have learned more about how to use Splunk to create reports, dashboards. Splunk Enterprise Software, or Splunk, is an extremely powerful tool for searching, exploring, and visualizing data of all types. Splunk is becoming increasingly popular, as more and more businesses, both large and small, discover its ease and usefulness. Analysts, managers, students, and others can quickly learn how to use the data from their systems, networks, web traffic, and social media to make attractive and informative reports. This is a straightforward, practical, and quick introduction to Splunk that should have you making reports and gaining insights from your data in no time. Resources for Article: Further resources on this subject: Lookups [article] Working with Apps in Splunk [article] Loading data, creating an app, and adding dashboards and reports in Splunk [article]
Read more
  • 0
  • 0
  • 11723
article-image-elasticsearch-administration
Packt
03 Mar 2015
28 min read
Save for later

Elasticsearch Administration

Packt
03 Mar 2015
28 min read
In this article by Rafał Kuć and Marek Rogoziński, author of the book Mastering Elasticsearch, Second Edition we will talk more about the Elasticsearch configuration and new features introduced in Elasticsearch 1.0 and higher. By the end of this article, you will have learned: (For more resources related to this topic, see here.) Configuring the discovery and recovery modules Using the Cat API that allows a human-readable insight into the cluster status The backup and restore functionality Federated search Discovery and recovery modules When starting your Elasticsearch node, one of the first things that Elasticsearch does is look for a master node that has the same cluster name and is visible in the network. If a master node is found, the starting node gets joined into an already formed cluster. If no master is found, then the node itself is selected as a master (of course, if the configuration allows such behavior). The process of forming a cluster and finding nodes is called discovery. The module responsible for discovery has two main purposes—electing a master and discovering new nodes within a cluster. After the cluster is formed, a process called recovery is started. During the recovery process, Elasticsearch reads the metadata and the indices from the gateway, and prepares the shards that are stored there to be used. After the recovery of the primary shards is done, Elasticsearch should be ready for work and should continue with the recovery of all the replicas (if they are present). In this section, we will take a deeper look at these two modules and discuss the possibilities of configuration Elasticsearch gives us and what the consequences of changing them are. Note that the information provided in the Discovery and recovery modules section is an extension of what we already wrote in Elasticsearch Server Second Edition, published by Packt Publishing. Discovery configuration As we have already mentioned multiple times, Elasticsearch was designed to work in a distributed environment. This is the main difference when comparing Elasticsearch to other open source search and analytics solutions available. With such assumptions, Elasticsearch is very easy to set up in a distributed environment, and we are not forced to set up additional software to make it work like this. By default, Elasticsearch assumes that the cluster is automatically formed by the nodes that declare the same cluster.name setting and can communicate with each other using multicast requests. This allows us to have several independent clusters in the same network. There are a few implementations of the discovery module that we can use, so let's see what the options are. Zen discovery Zen discovery is the default mechanism that's responsible for discovery in Elasticsearch and is available by default. The default Zen discovery configuration uses multicast to find other nodes. This is a very convenient solution: just start a new Elasticsearch node and everything works—this node will be joined to the cluster if it has the same cluster name and is visible by other nodes in that cluster. This discovery method is perfectly suited for development time, because you don't need to care about the configuration; however, it is not advised that you use it in production environments. Relying only on the cluster name is handy but can also lead to potential problems and mistakes, such as the accidental joining of nodes. Sometimes, multicast is not available for various reasons or you don't want to use it for these mentioned reasons. In the case of bigger clusters, the multicast discovery may generate too much unnecessary traffic, and this is another valid reason why it shouldn't be used for production. For these cases, Zen discovery allows us to use the unicast mode. When using the unicast Zen discovery, a node that is not a part of the cluster will send a ping request to all the addresses specified in the configuration. By doing this, it informs all the specified nodes that it is ready to be a part of the cluster and can be either joined to an existing cluster or can form a new one. Of course, after the node joins the cluster, it gets the cluster topology information, but the initial connection is only done to the specified list of hosts. Remember that even when using unicast Zen discovery, the Elasticsearch node still needs to have the same cluster name as the other nodes. If you want to know more about the differences between multicast and unicast ping methods, refer to these URLs: http://en.wikipedia.org/wiki/Multicast and http://en.wikipedia.org/wiki/Unicast. If you still want to learn about the configuration properties of multicast Zen discovery, let's look at them. Multicast Zen discovery configuration The multicast part of the Zen discovery module exposes the following settings: discovery.zen.ping.multicast.address (the default: all available interfaces): This is the interface used for the communication given as the address or interface name. discovery.zen.ping.multicast.port (the default: 54328): This port is used for communication. discovery.zen.ping.multicast.group (the default: 224.2.2.4): This is the multicast address to send messages to. discovery.zen.ping.multicast.buffer_size (the default: 2048): This is the size of the buffer used for multicast messages. discovery.zen.ping.multicast.ttl (the default: 3): This is the time for which a multicast message lives. Every time a packet crosses the router, the TTL is decreased. This allows for the limiting area where the transmission can be received. Note that routers can have the threshold values assigned compared to TTL, which causes that TTL value to not match exactly the number of routers that a packet can jump over. discovery.zen.ping.multicast.enabled (the default: true): Setting this property to false turns off the multicast. You should disable multicast if you are planning to use the unicast discovery method. The unicast Zen discovery configuration The unicast part of Zen discovery provides the following configuration options: discovery.zen.ping.unicats.hosts: This is the initial list of nodes in the cluster. The list can be defined as a list or as an array of hosts. Every host can be given a name (or an IP address) or have a port or port range added. For example, the value of this property can look like this: ["master1", "master2:8181", "master3[80000-81000]"]. So, basically, the hosts' list for the unicast discovery doesn't need to be a complete list of Elasticsearch nodes in your cluster, because once the node is connected to one of the mentioned nodes, it will be informed about all the others that form the cluster. discovery.zen.ping.unicats.concurrent_connects (the default: 10): This is the maximum number of concurrent connections unicast discoveries will use. If you have a lot of nodes that the initial connection should be made to, it is advised that you increase the default value. Master node One of the main purposes of discovery apart from connecting to other nodes is to choose a master node—a node that will take care of and manage all the other nodes. This process is called master election and is a part of the discovery module. No matter how many master eligible nodes there are, each cluster will only have a single master node active at a given time. If there is more than one master eligible node present in the cluster, they can be elected as the master when the original master fails and is removed from the cluster. Configuring master and data nodes By default, Elasticsearch allows every node to be a master node and a data node. However, in certain situations, you may want to have worker nodes, which will only hold the data or process the queries and the master nodes that will only be used as cluster-managed nodes. One of these situations is to handle a massive amount of data, where data nodes should be as performant as possible, and there shouldn't be any delay in master nodes' responses. Configuring data-only nodes To set the node to only hold data, we need to instruct Elasticsearch that we don't want such a node to be a master node. In order to do this, we add the following properties to the elasticsearch.yml configuration file: node.master: falsenode.data: true Configuring master-only nodes To set the node not to hold data and only to be a master node, we need to instruct Elasticsearch that we don't want such a node to hold data. In order to do that, we add the following properties to the elasticsearch.yml configuration file: node.master: truenode.data: false Configuring the query processing-only nodes For large enough deployments, it is also wise to have nodes that are only responsible for aggregating query results from other nodes. Such nodes should be configured as nonmaster and nondata, so they should have the following properties in the elasticsearch.yml configuration file: node.master: falsenode.data: false Please note that the node.master and the node.data properties are set to true by default, but we tend to include them for configuration clarity. The master election configuration We already wrote about the master election configuration in Elasticsearch Server Second Edition, but this topic is very important, so we decided to refresh our knowledge about it. Imagine that you have a cluster that is built of 10 nodes. Everything is working fine until, one day, your network fails and three of your nodes are disconnected from the cluster, but they still see each other. Because of the Zen discovery and the master election process, the nodes that got disconnected elect a new master and you end up with two clusters with the same name with two master nodes. Such a situation is called a split-brain and you must avoid it as much as possible. When a split-brain happens, you end up with two (or more) clusters that won't join each other until the network (or any other) problems are fixed. If you index your data during this time, you may end up with data loss and unrecoverable situations when the nodes get joined together after the network split. In order to prevent split-brain situations or at least minimize the possibility of their occurrences, Elasticsearch provides a discovery.zen.minimum_master_nodes property. This property defines a minimum amount of master eligible nodes that should be connected to each other in order to form a cluster. So now, let's get back to our cluster; if we set the discovery.zen.minimum_master_nodes property to 50 percent of the total nodes available plus one (which is six, in our case), we would end up with a single cluster. Why is that? Before the network failure, we would have 10 nodes, which is more than six nodes, and these nodes would form a cluster. After the disconnections of the three nodes, we would still have the first cluster up and running. However, because only three nodes disconnected and three is less than six, these three nodes wouldn't be allowed to elect a new master and they would wait for reconnection with the original cluster. Zen discovery fault detection and configuration Elasticsearch runs two detection processes while it is working. The first process is to send ping requests from the current master node to all the other nodes in the cluster to check whether they are operational. The second process is a reverse of that—each of the nodes sends ping requests to the master in order to verify that it is still up and running and performing its duties. However, if we have a slow network or our nodes are in different hosting locations, the default configuration may not be sufficient. Because of this, the Elasticsearch discovery module exposes three properties that we can change: discovery.zen.fd.ping_interval: This defaults to 1s and specifies the interval of how often the node will send ping requests to the target node. discovery.zen.fd.ping_timeout: This defaults to 30s and specifies how long the node will wait for the sent ping request to be responded to. If your nodes are 100 percent utilized or your network is slow, you may consider increasing that property value. discovery.zen.fd.ping_retries: This defaults to 3 and specifies the number of ping request retries before the target node will be considered not operational. You can increase this value if your network has a high number of lost packets (or you can fix your network). There is one more thing that we would like to mention. The master node is the only node that can change the state of the cluster. To achieve a proper cluster state updates sequence, Elasticsearch master nodes process single cluster state update requests one at a time, make the changes locally, and send the request to all the other nodes so that they can synchronize their state. The master nodes wait for the given time for the nodes to respond, and if the time passes or all the nodes are returned, with the current acknowledgment information, it proceeds with the next cluster state update request processing. To change the time, the master node waits for all the other nodes to respond, and you should modify the default 30 seconds time by setting the discovery.zen.publish_timeout property. Increasing the value may be needed for huge clusters working in an overloaded network. The Amazon EC2 discovery Amazon, in addition to selling goods, has a few popular services such as selling storage or computing power in a pay-as-you-go model. So-called Amazon Elastic Compute Cloud (EC2) provides server instances and, of course, they can be used to install and run Elasticsearch clusters (among many other things, as these are normal Linux machines). This is convenient—you pay for instances that are needed in order to handle the current traffic or to speed up calculations, and you shut down unnecessary instances when the traffic is lower. Elasticsearch works well on EC2, but due to the nature of the environment, some features may work slightly differently. One of these features that works differently is discovery, because Amazon EC2 doesn't support multicast discovery. Of course, we can switch to unicast discovery, but sometimes, we want to be able to automatically discover nodes and, with unicast, we need to at least provide the initial list of hosts. However, there is an alternative—we can use the Amazon EC2 plugin, a plugin that combines the multicast and unicast discovery methods using the Amazon EC2 API. Make sure that during the set up of EC2 instances, you set up communication between them (on port 9200 and 9300 by default). This is crucial in order to have Elasticsearch nodes communicate with each other and, thus, cluster functioning is required. Of course, this communication depends on network.bind_host and network.publish_host (or network.host) settings. The EC2 plugin installation The installation of a plugin is as simple as with most of the plugins. In order to install it, we should run the following command: bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.0 The EC2 plugin's generic configuration This plugin provides several configuration settings that we need to provide in order for the EC2 discovery to work: cluster.aws.access_key: Amazon access key—one of the credential values you can find in the Amazon configuration panel cluster.aws.secret_key: Amazon secret key—similar to the previously mentioned access_key setting, it can be found in the EC2 configuration panel The last thing is to inform Elasticsearch that we want to use a new discovery type by setting the discovery.type property to ec2 value and turn off multicast. Optional EC2 discovery configuration options The previously mentioned settings are sufficient to run the EC2 discovery, but in order to control the EC2 discovery plugin behavior, Elasticsearch exposes additional settings: cloud.aws.region: This region will be used to connect with Amazon EC2 web services. You can choose a region that's adequate for the region where your instance resides, for example, eu-west-1 for Ireland. The possible values can be eu-west, sa-east, us-east, us-west-1, us-west-2, ap-southeast-1, and ap-southeast-1. cloud.aws.ec2.endpoint: If you are using EC2 API services, instead of defining a region, you can provide an address of the AWS endpoint, for example, ec2.eu-west-1.amazonaws.com. cloud.aws.protocol: This is the protocol that should be used by the plugin to connect to Amazon Web Services endpoints. By default, Elasticsearch will use the HTTPS protocol (which means setting the value of the property to https). We can also change this behavior and set the property to http for the plugin to use HTTP without encryption. We are also allowed to overwrite the cloud.aws.protocol settings for each service by using the cloud.aws.ec2.protocol and cloud.aws.s3.protocol properties (the possible values are the same—https and http). cloud.aws.proxy_host: Elasticsearch allows us to define a proxy that will be used to connect to AWS endpoints. The cloud.aws.proxy_host property should be set to the address to the proxy that should be used. cloud.aws.proxy_port: The second property related to the AWS endpoints proxy allows us to specify the port on which the proxy is listening. The cloud.aws.proxy_port property should be set to the port on which the proxy listens. discovery.ec2.ping_timeout (the default: 3s): This is the time to wait for the response for the ping message sent to the other node. After this time, the nonresponsive node will be considered dead and removed from the cluster. Increasing this value makes sense when dealing with network issues or we have a lot of EC2 nodes. The EC2 nodes scanning configuration The last group of settings we want to mention allows us to configure a very important thing when building cluster working inside the EC2 environment—the ability to filter available Elasticsearch nodes in our Amazon Elastic Cloud Computing network. The Elasticsearch EC2 plugin exposes the following properties that can help us configure its behavior: discovery.ec2.host_type: This allows us to choose the host type that will be used to communicate with other nodes in the cluster. The values we can use are private_ip (the default one; the private IP address will be used for communication), public_ip (the public IP address will be used for communication), private_dns (the private hostname will be used for communication), and public_dns (the public hostname will be used for communication). discovery.ec2.groups: This is a comma-separated list of security groups. Only nodes that fall within these groups can be discovered and included in the cluster. discovery.ec2.availability_zones: This is array or command-separated list of availability zones. Only nodes with the specified availability zones will be discovered and included in the cluster. discovery.ec2.any_group (this defaults to true): Setting this property to false will force the EC2 discovery plugin to discover only those nodes that reside in an Amazon instance that falls into all of the defined security groups. The default value requires only a single group to be matched. discovery.ec2.tag: This is a prefix for a group of EC2-related settings. When you launch your Amazon EC2 instances, you can define tags, which can describe the purpose of the instance, such as the customer name or environment type. Then, you use these defined settings to limit discovery nodes. Let's say you define a tag named environment with a value of qa. In the configuration, you can now specify the following: discovery.ec2.tag.environment: qa and only nodes running on instances with this tag will be considered for discovery. cloud.node.auto_attributes: When this is set to true, Elasticsearch will add EC2-related node attributes (such as the availability zone or group) to the node properties and will allow us to use them, adjusting the Elasticsearch shard allocation and configuring the shard placement. Other discovery implementations The Zen discovery and EC2 discovery are not the only discovery types that are available. There are two more discovery types that are developed and maintained by the Elasticsearch team, and these are: Azure discovery: https://github.com/elasticsearch/elasticsearch-cloud-azure Google Compute Engine discovery: https://github.com/elasticsearch/elasticsearch-cloud-gce In addition to these, there are a few discovery implementations provided by the community, such as the ZooKeeper discovery for older versions of Elasticsearch (https://github.com/sonian/elasticsearch-zookeeper). The gateway and recovery configuration The gateway module allows us to store all the data that is needed for Elasticsearch to work properly. This means that not only is the data in Apache Lucene indices stored, but also all the metadata (for example, index allocation settings), along with the mappings configuration for each index. Whenever the cluster state is changed, for example, when the allocation properties are changed, the cluster state will be persisted by using the gateway module. When the cluster is started up, its state will be loaded using the gateway module and applied. One should remember that when configuring different nodes and different gateway types, indices will use the gateway type configuration present on the given node. If an index state should not be stored using the gateway module, one should explicitly set the index gateway type to none. The gateway recovery process Let's say explicitly that the recovery process is used by Elasticsearch to load the data stored with the use of the gateway module in order for Elasticsearch to work. Whenever a full cluster restart occurs, the gateway process kicks in to load all the relevant information we've mentioned—the metadata, the mappings, and of course, all the indices. When the recovery process starts, the primary shards are initialized first, and then, depending on the replica state, they are initialized using the gateway data, or the data is copied from the primary shards if the replicas are out of sync. Elasticsearch allows us to configure when the cluster data should be recovered using the gateway module. We can tell Elasticsearch to wait for a certain number of master eligible or data nodes to be present in the cluster before starting the recovery process. However, one should remember that when the cluster is not recovered, all the operations performed on it will not be allowed. This is done in order to avoid modification conflicts. Configuration properties Before we continue with the configuration, we would like to say one more thing. As you know, Elasticsearch nodes can play different roles—they can have a role of data nodes—the ones that hold data—they can have a master role, or they can be only used for request handing, which means not holding data and not being master eligible. Remembering all this, let's now look at the gateway configuration properties that we are allowed to modify: gateway.recover_after_nodes: This is an integer number that specifies how many nodes should be present in the cluster for the recovery to happen. For example, when set to 5, at least 5 nodes (doesn't matter whether they are data or master eligible nodes) must be present for the recovery process to start. gateway.recover_after_data_nodes: This is an integer number that allows us to set how many data nodes should be present in the cluster for the recovery process to start. gateway.recover_after_master_nodes: This is another gateway configuration option that allows us to set how many master eligible nodes should be present in the cluster for the recovery to start. gateway.recover_after_time: This allows us to set how much time to wait before the recovery process starts after the conditions defined by the preceding properties are met. If we set this property to 5m, we tell Elasticsearch to start the recovery process 5 minutes after all the defined conditions are met. The default value for this property is 5m, starting from Elasticsearch 1.3.0. Let's imagine that we have six nodes in our cluster, out of which four are data eligible. We also have an index that is built of three shards, which are spread across the cluster. The last two nodes are master eligible and they don't hold the data. What we would like to configure is the recovery process to be delayed for 3 minutes after the four data nodes are present. Our gateway configuration could look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3m Expectations on nodes In addition to the already mentioned properties, we can also specify properties that will force the recovery process of Elasticsearch. These properties are: gateway.expected_nodes: This is the number of nodes expected to be present in the cluster for the recovery to start immediately. If you don't need the recovery to be delayed, it is advised that you set this property to the number of nodes (or at least most of them) with which the cluster will be formed from, because that will guarantee that the latest cluster state will be recovered. gateway.expected_data_nodes: This is the number of expected data eligible nodes to be present in the cluster for the recovery process to start immediately. gateway.expected_master_nodes: This is the number of expected master eligible nodes to be present in the cluster for the recovery process to start immediately. Now, let's get back to our previous example. We know that when all six nodes are connected and are in the cluster, we want the recovery to start. So, in addition to the preceeding configuration, we would add the following property: gateway.expected_nodes: 6 So the whole configuration would look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3mgateway.expected_nodes: 6 The preceding configuration says that the recovery process will be delayed for 3 minutes once four data nodes join the cluster and will begin immediately after six nodes are in the cluster (doesn't matter whether they are data nodes or master eligible nodes). The local gateway With the release of Elasticsearch 0.20 (and some of the releases from 0.19 versions), all the gateway types, apart from the default local gateway type, were deprecated. It is advised that you do not use them, because they will be removed in future versions of Elasticsearch. This is still not the case, but if you want to avoid full data reindexation, you should only use the local gateway type, and this is why we won't discuss all the other types. The local gateway type uses a local storage available on a node to store the metadata, mappings, and indices. In order to use this gateway type and the local storage available on the node, there needs to be enough disk space to hold the data with no memory caching. The persistence to the local gateway is different from the other gateways that are currently present (but deprecated). The writes to this gateway are done in a synchronous manner in order to ensure that no data will be lost during the write process. In order to set the type of gateway that should be used, one should use the gateway.type property, which is set to local by default. There is one additional thing regarding the local gateway of Elasticsearch that we didn't talk about—dangling indices. When a node joins a cluster, all the shards and indices that are present on the node, but are not present in the cluster, will be included in the cluster state. Such indices are called dangling indices, and we are allowed to choose how Elasticsearch should treat them. Elasticsearch exposes the gateway.local.auto_import_dangling property, which can take the value of yes (the default value that results in importing all dangling indices into the cluster), close (results in importing the dangling indices into the cluster state but keeps them closed by default), and no (results in removing the dangling indices). When setting the gateway.local.auto_import_dangling property to no, we can also set the gateway.local.dangling_timeout property (defaults to 2h) to specify how long Elasticsearch will wait while deleting the dangling indices. The dangling indices feature can be nice when we restart old Elasticsearch nodes, and we don't want old indices to be included in the cluster. Low-level recovery configuration We discussed that we can use the gateway to configure the behavior of the Elasticsearch recovery process, but in addition to that, Elasticsearch allows us to configure the recovery process itself. However, we decided that it would be good to mention the properties we can use in the section dedicated to gateway and recovery. Cluster- level recovery configuration The recovery configuration is specified mostly on the cluster level and allows us to set general rules for the recovery module to work with. These settings are: indices.recovery.concurrent_streams: This defaults to 3 and specifies the number of concurrent streams that are allowed to be opened in order to recover a shard from its source. The higher the value of this property, the more pressure will be put on the networking layer; however, the recovery may be faster, depending on your network usage and throughput. indices.recovery.max_bytes_per_sec: By default, this is set to 20MB and specifies the maximum number of data that can be transferred during shard recovery per second. In order to disable data transfer limiting, one should set this property to 0. Similar to the number of concurrent streams, this property allows us to control the network usage of the recovery process. Setting this property to higher values may result in higher network utilization and a faster recovery process. indices.recovery.compress: This is set to true by default and allows us to define whether ElasticSearch should compress the data that is transferred during the recovery process. Setting this to false may lower the pressure on the CPU, but it will also result in more data being transferred over the network. indices.recovery.file_chunk_size: This is the chunk size used to copy the shard data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. indices.recovery.translog_ops: This defaults to 1000 and specifies how many transaction log lines should be transferred between shards in a single request during the recovery process. indices.recovery.translog_size: This is the chunk size used to copy the shard transaction log data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. In the versions prior to Elasticsearch 0.90.0, there was the indices.recovery.max_size_per_sec property that could be used, but it was deprecated, and it is suggested that you use the indices.recovery.max_bytes_per_sec property instead. However, if you are using an Elasticsearch version older than 0.90.0, it may be worth remembering this. All the previously mentioned settings can be updated using the Cluster Update API, or they can be set in the elasticsearch.yml file. Index-level recovery settings In addition to the values mentioned previously, there is a single property that can be set on a per-index basis. The property can be set both in the elasticsearch.yml file and using the indices Update Settings API, and it is called index.recovery.initial_shards. In general, Elasticsearch will only recover a particular shard when there is a quorum of shards present and if that quorum can be allocated. A quorum is 50 percent of the shards for the given index plus one. By using the index.recovery.initial_shards property, we can change what Elasticsearch will take as a quorum. This property can be set to the one of the following values: quorum: 50 percent, plus one shard needs to be present and be allocable. This is the default value. quorum-1: 50 percent of the shards for a given index need to be present and be allocable. full: All of the shards for the given index need to be present and be allocable. full-1: 100 percent minus one shards for the given index need to be present and be allocable. integer value: Any integer such as 1, 2, or 5 specifies the number of shards that are needed to be present and that can be allocated. For example, setting this value to 2 will mean that at least two shards need to be present and Elasticsearch needs at least 2 shards to be allocable. It is good to know about this property, but in most cases, the default value will be sufficient for most deployments. Summary In this article, we focused more on the Elasticsearch configuration and new features that were introduced in Elasticsearch 1.0. We configured discovery and recovery, and we used the human-friendly Cat API. In addition to that, we used the backup and restore functionality, which allowed easy backup and recovery of our indices. Finally, we looked at what federated search is and how to search and index data to multiple clusters, while still using all the functionalities of Elasticsearch and being connected to a single node. If you want to dig deeper, buy the book Mastering Elasticsearch, Second Edition and read in a simple step-by-step fashion using Elasticsearch to enhance your knowlege further. Resources for Article: Further resources on this subject: Downloading and Setting Up ElasticSearch [Article] Indexing the Data [Article] Driving Visual Analyses with Automobile Data (Python) [Article]
Read more
  • 0
  • 0
  • 5417

article-image-mapreduce-functions
Packt
03 Mar 2015
11 min read
Save for later

MapReduce functions

Packt
03 Mar 2015
11 min read
 In this article, by John Zablocki, author of the book, Couchbase Essentials, you will be acquainted to MapReduce and how you'll use it to create secondary indexes for our documents. At its simplest, MapReduce is a programming pattern used to process large amounts of data that is typically distributed across several nodes in parallel. In the NoSQL world, MapReduce implementations may be found on many platforms from MongoDB to Hadoop, and of course, Couchbase. Even if you're new to the NoSQL landscape, it's quite possible that you've already worked with a form of MapReduce. The inspiration for MapReduce in distributed NoSQL systems was drawn from the functional programming concepts of map and reduce. While purely functional programming languages haven't quite reached mainstream status, languages such as Python, C#, and JavaScript all support map and reduce operations. (For more resources related to this topic, see here.) Map functions Consider the following Python snippet: numbers = [1, 2, 3, 4, 5] doubled = map(lambda n: n * 2, numbers) #doubled == [2, 4, 6, 8, 10] These two lines of code demonstrate a very simple use of a map() function. In the first line, the numbers variable is created as a list of integers. The second line applies a function to the list to create a new mapped list. In this case, the map() function is supplied as a Python lambda, which is just an inline, unnamed function. The body of lambda multiplies each number by two. This map() function can be made slightly more complex by doubling only odd numbers, as shown in this code: numbers = [1, 2, 3, 4, 5] defdouble_odd(num):   if num % 2 == 0:     return num   else:     return num * 2   doubled = map(double_odd, numbers) #doubled == [2, 2, 6, 4, 10] Map functions are implemented differently in each language or platform that supports them, but all follow the same pattern. An iterable collection of objects is passed to a map function. Each item of the collection is then iterated over with the map function being applied to that iteration. The final result is a new collection where each of the original items is transformed by the map. Reduce functions Like maps, the reduce functions also work by applying a provided function to an iterable data structure. The key difference between the two is that the reduce function works to produce a single value from the input iterable. Using Python's built-in reduce() function, we can see how to produce a sum of integers, as follows: numbers = [1, 2, 3, 4, 5] sum = reduce(lambda x, y: x + y, numbers) #sum == 15 You probably noticed that unlike our map operation, the reduce lambda has two parameters (x and y in this case). The argument passed to x will be the accumulated value of all applications of the function so far, and y will receive the next value to be added to the accumulation. Parenthetically, the order of operations can be seen as ((((1 + 2) + 3) + 4) + 5). Alternatively, the steps are shown in the following list: x = 1, y = 2 x = 3, y = 3 x = 6, y = 4 x = 10, y = 5 x = 15 As this list demonstrates, the value of x is the cumulative sum of previous x and y values. As such, reduce functions are sometimes termed accumulate or fold functions. Regardless of their name, reduce functions serve the common purpose of combining pieces of a recursive data structure to produce a single value. Couchbase MapReduce Creating an index (or view) in Couchbase requires creating a map function written in JavaScript. When the view is created for the first time, the map function is applied to each document in the bucket containing the view. When you update a view, only new or modified documents are indexed. This behavior is known as incremental MapReduce. You can think of a basic map function in Couchbase as being similar to a SQL CREATE INDEX statement. Effectively, you are defining a column or a set of columns, to be indexed by the server. Of course, these are not columns, but rather properties of the documents to be indexed. Basic mapping To illustrate the process of creating a view, first imagine that we have a set of JSON documents as shown here: var books=[     { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow"     },     { "id": 2, "title": "The Godfather", "author": "Mario Puzzo"     },     { "id": 3, "title": "Wiseguy", "author": "Nicholas Pileggi"     } ]; Each document contains title and author properties. In Couchbase, to query these documents by either title or author, we'd first need to write a map function. Without considering how map functions are written in Couchbase, we're able to understand the process with vanilla JavaScript: books.map(function(book) {   return book.author; }); In the preceding snippet, we're making use of the built-in JavaScript array's map() function. Similar to the Python snippets we saw earlier, JavaScript's map() function takes a function as a parameter and returns a new array with mapped objects. In this case, we'll have an array with each book's author, as follows: ["Robert Ludlow", "Mario Puzzo", "Nicholas Pileggi"] At this point, we have a mapped collection that will be the basis for our author index. However, we haven't provided a means for the index to be able to refer back to its original document. If we were using a relational database, we'd have effectively created an index on the Title column with no way to get back to the row that contained it. With a slight modification to our map function, we are able to provide the key (the id property) of the document as well in our index: books.map(function(book) {   return [book.author, book.id]; }); In this slightly modified version, we're including the ID with the output of each author. In this way, the index has its document's key stored with its title. [["The Bourne Identity", 1], ["The Godfather", 2], ["Wiseguy", 3]] We'll soon see how this structure more closely resembles the values stored in a Couchbase index. Basic reducing Not every Couchbase index requires a reduce component. In fact, we'll see that Couchbase already comes with built-in reduce functions that will provide you with most of the reduce behavior you need. However, before relying on only those functions, it's important to understand why you'd use a reduce function in the first place. Returning to the preceding example of the map, let's imagine we have a few more documents in our set, as follows: var books=[     { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow"     },     { "id": 2, "title": "The Bourne Ultimatum", "author": "Robert Ludlow"     },     { "id": 3, "title": "The Godfather", "author": "Mario Puzzo"     },     { "id": 4, "title": "The Bourne Supremacy", "author": "Robert Ludlow"     },     { "id": 5, "title": "The Family", "author": "Mario Puzzo"     },  { "id": 6, "title": "Wiseguy", "author": "Nicholas Pileggi"     } ]; We'll still create our index using the same map function because it provides a way of accessing a book by its author. Now imagine that we want to know how many books an author has written, or (assuming we had more data) the average number of pages written by an author. These questions are not possible to answer with a map function alone. Each application of the map function knows nothing about the previous application. In other words, there is no way for you to compare or accumulate information about one author's book to another book by the same author. Fortunately, there is a solution to this problem. As you've probably guessed, it's the use of a reduce function. As a somewhat contrived example, consider this JavaScript: mapped = books.map(function (book) {     return ([book.id, book.author]); });   counts = {} reduced = mapped.reduce(function(prev, cur, idx, arr) { var key = cur[1];     if (! counts[key]) counts[key] = 0;     ++counts[key] }, null); This code doesn't quite accurately reflect the way you would count books with Couchbase but it illustrates the basic idea. You look for each occurrence of a key (author) and increment a counter when it is found. With Couchbase MapReduce, the mapped structure is supplied to the reduce() function in a better format. You won't need to keep track of items in a dictionary. Couchbase views At this point, you should have a general sense of what MapReduce is, where it came from, and how it will affect the creation of a Couchbase Server view. So without further ado, let's see how to write our first Couchbase view. In fact, there were two to choose from. The bucket we'll use is beer-sample. If you didn't install it, don't worry. You can add it by opening the Couchbase Console and navigating to the Settings tab. Here, you'll find the option to install the bucket, as shown next: First, you need to understand the document structures with which you're working. The following JSON object is a beer document (abbreviated for brevity): {  "name": "Sundog",  "type": "beer",  "brewery_id": "new_holland_brewing_company",  "description": "Sundog is an amber ale...",  "style": "American-Style Amber/Red Ale",  "category": "North American Ale" } As you can see, the beer documents have several properties. We're going to create an index to let us query these documents by name. In SQL, the query would look like this: SELECT Id FROM Beers WHERE Name = ? You might be wondering why the SQL example includes only the Id column in its projection. For now, just know that to query a document using a view with Couchbase, the property by which you're querying must be included in an index. To create that index, we'll write a map function. The simplest example of a map function to query beer documents by name is as follows: function(doc) {   emit(doc.name); } This body of the map function has only one line. It calls the built-in Couchbase emit() function. This function is used to signal that a value should be indexed. The output of this map function will be an array of names. The beer-sample bucket includes brewery data as well. These documents look like the following code (abbreviated for brevity): {   "name": "Thomas Hooker Brewing",   "city": "Bloomfield",   "state": "Connecticut",   "website": "http://www.hookerbeer.com/",   "type": "brewery" } If we reexamine our map function, we'll see an obvious problem; both the brewery and beer documents have a name property. When this map function is applied to the documents in the bucket, it will create an index with documents from either the brewery or beer documents. The problem is that Couchbase documents exist in a single container—the bucket. There is no namespace for a set of related documents. The solution has typically involved including a type or docType property on each document. The value of this property is used to distinguish one document from another. In the case of the beer-sample database, beer documents have type = "beer" and brewery documents have type = "brewery". Therefore, we are easily able to modify our map function to create an index only on beer documents: function(doc) {   if (doc.type == "beer") {     emit(doc.name);   } } The emit() function actually takes two arguments. The first, as we've seen, emits a value to be indexed. The second argument is an optional value and is used by the reduce function. Imagine that we want to count the number of beer types in a particular category. In SQL, we would write the following query: SELECT Category, COUNT(*) FROM Beers GROUP BY Category To achieve the same functionality with Couchbase Server, we'll need to use both map and reduce functions. First, let's write the map. It will create an index on the category property: function(doc) {   if (doc.type == "beer") {     emit(doc.category, 1);   } } The only real difference between our category index and our name index is that we're including an argument for the value parameter of the emit() function. What we'll do with that value is simply count them. This counting will be done in our reduce function: function(keys, values) {   return values.length; } In this example, the values parameter will be given to the reduce function as a list of all values associated with a particular key. In our case, for each beer category, there will be a list of ones (that is, [1, 1, 1, 1, 1, 1]). Couchbase also provides a built-in _count function. It can be used in place of the entire reduce function in the preceding example. Now that we've seen the basic requirements when creating an actual Couchbase view, it's time to add a view to our bucket. The easiest way to do so is to use the Couchbase Console. Summary In this article, you learned the purpose of secondary indexes in a key/value store. We dug deep into MapReduce, both in terms of its history in functional languages and as a tool for NoSQL and big data systems. Resources for Article: Further resources on this subject: Map Reduce? [article] Introduction to Mapreduce [article] Working with Apps Splunk [article]
Read more
  • 0
  • 0
  • 4795

article-image-performance-considerations
Packt
03 Mar 2015
13 min read
Save for later

Performance Considerations

Packt
03 Mar 2015
13 min read
In this article by Dayong Du, the author of Apache Hive Essentials, we will look at the different performance considerations when using Hive. Although Hive is built to deal with big data, we still cannot ignore the importance of performance. Most of the time, a better Hive query can rely on the smart query optimizer to find the best execution strategy as well as the default setting best practice from vendor packages. However, as experienced users, we should learn more about the theory and practice of performance tuning in Hive, especially when working in a performance-based project or environment. We will start from utilities available in Hive to find potential issues causing poor performance. Then, we introduce the best practices of performance considerations in the areas of queries and job. (For more resources related to this topic, see here.) Performance utilities Hive provides the EXPLAIN and ANALYZE statements that can be used as utilities to check and identify the performance of queries. The EXPLAIN statement Hive provides an EXPLAIN command to return a query execution plan without running the query. We can use an EXPLAIN command for queries if we have a doubt or a concern about performance. The EXPLAIN command will help to see the difference between two or more queries for the same purpose. The syntax for EXPLAIN is as follows: EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] hive_query The following keywords can be used: EXTENDED: This provides additional information for the operators in the plan, such as file pathname and abstract syntax tree. DEPENDENCY: This provides a JSON format output that contains a list of tables and partitions that the query depends on. It is available since HIVE 0.10.0. AUTHORIZATION: This lists all entities needed to be authorized including input and output to run the Hive query and authorization failures, if any. It is available since HIVE 0.14.0. A typical query plan contains the following three sections. We will also have a look at an example later: Abstract syntax tree (AST): Hive uses a pacer generator called ANTLR (see http://www.antlr.org/) to automatically generate a tree of syntax for HQL. We can usually ignore this most of the time. Stage dependencies: This lists all dependencies and number of stages used to run the query. Stage plans: It contains important information, such as operators and sort orders, for running the job. The following is what a typical query plan looks like. From the following example, we can see that the AST section is not shown since the EXTENDED keyword is not used with EXPLAIN. In the STAGE DEPENDENCIES section, both Stage-0 and Stage-1 are independent root stages. In the STAGE PLANS section, Stage-1 has one map and reduce referred to by Map Operator Tree and Reduce Operator Tree. Inside each Map/Reduce Operator Tree section, all operators corresponding to Hive query keywords as well as expressions and aggregations are listed. The Stage-0 stage does not have map and reduce. It is just a Fetch operation. jdbc:hive2://> EXPLAIN SELECT sex_age.sex, count(*). . . . . . .> FROM employee_partitioned. . . . . . .> WHERE year=2014 GROUP BY sex_age.sex LIMIT 2;+-----------------------------------------------------------------------------+| Explain |+-----------------------------------------------------------------------------+| STAGE DEPENDENCIES: || Stage-1 is a root stage || Stage-0 is a root stage || || STAGE PLANS: || Stage: Stage-1 || Map Reduce || Map Operator Tree: || TableScan || alias: employee_partitioned || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Select Operator || expressions: sex_age (type: struct<sex:string,age:int>) || outputColumnNames: sex_age || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Group By Operator || aggregations: count() || keys: sex_age.sex (type: string) || mode: hash || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Reduce Output Operator || key expressions: _col0 (type: string) || sort order: + || Map-reduce partition columns: _col0 (type: string) || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL|| Column stats: NONE || value expressions: _col1 (type: bigint) || Reduce Operator Tree: || Group By Operator || aggregations: count(VALUE._col0) || keys: KEY._col0 (type: string) || mode: mergepartial || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || Select Operator || expressions: _col0 (type: string), _col1 (type: bigint) || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || Limit || Number of rows: 2 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || File Output Operator || compressed: false || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || table: || input format: org.apache.hadoop.mapred.TextInputFormat || output format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|| serde:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe|| || Stage: Stage-0 || Fetch Operator || limit: 2 |+-----------------------------------------------------------------------------+53 rows selected (0.26 seconds) The ANALYZE statement Hive statistics are a collection of data that describe more details, such as the number of rows, number of files, and raw data size, on the objects in the Hive database. Statistics is a metadata of Hive data. Hive supports statistics at the table, partition, and column level. These statistics serve as an input to the Hive Cost-Based Optimizer (CBO), which is an optimizer to pick the query plan with the lowest cost in terms of system resources required to complete the query. The statistics are gathered through the ANALYZE statement since Hive 0.10.0 on tables, partitions, and columns as given in the following examples: jdbc:hive2://> ANALYZE TABLE employee COMPUTE STATISTICS;No rows affected (27.979 seconds)jdbc:hive2://> ANALYZE TABLE employee_partitioned. . . . . . .> PARTITION(year=2014, month=12) COMPUTE STATISTICS;No rows affected (45.054 seconds)jdbc:hive2://> ANALYZE TABLE employee_id COMPUTE STATISTICS. . . . . . .> FOR COLUMNS employee_id;No rows affected (41.074 seconds) Once the statistics are built, we can check the statistics by the DESCRIBE EXTENDED/FORMATTED statement. From the table/partition output, we can find the statistics information inside the parameters, such as parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}). The following is an example: jdbc:hive2://> DESCRIBE EXTENDED employee_partitioned. . . . . . .> PARTITION(year=2014, month=12);jdbc:hive2://> DESCRIBE EXTENDED employee;…parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}).jdbc:hive2://> DESCRIBE FORMATTED employee.name;+--------+---------+---+---+---------+--------------+-----------+-----------+|col_name|data_type|min|max|num_nulls|distinct_count|avg_col_len|max_col_len|+--------+---------+---+---+---------+--------------+-----------+-----------+| name | string | | | 0 | 5 | 5.6 | 7 |+--------+---------+---+---+---------+--------------+-----------+-----------++---------+----------+-----------------+|num_trues|num_falses| comment |+---------+----------+-----------------+| | |from deserializer|+---------+----------+-----------------+3 rows selected (0.116 seconds) Hive statistics are persisted in the metastore to avoid computing them every time. For newly created tables and/or partitions, statistics are automatically computed by default if we enable the following setting: jdbc:hive2://> SET hive.stats.autogather=ture; Hive logs Logs provide useful information to find out how a Hive query/job runs. By checking the Hive logs, we can identify runtime problems and issues that may cause bad performance. There are two types of logs available in Hive: system log and job log. The system log contains the Hive running status and issues. It is configured in {HIVE_HOME}/conf/hive-log4j.properties. The following three lines for Hive log can be found: hive.root.logger=WARN,DRFAhive.log.dir=/tmp/${user.name}hive.log.file=hive.log To modify the status, we can either modify the preceding lines in hive-log4j.properties (applies to all users) or set from the Hive CLI (only applies to the current user and current session) as follows: hive --hiveconf hive.root.logger=DEBUG,console The job log contains Hive query information and is saved at the same place, /tmp/${user.name}, by default as one file for each Hive user session. We can override it in hive-site.xml with the hive.querylog.location property. If a Hive query generates MapReduce jobs, those logs can also be viewed through the Hadoop JobTracker Web UI. Job and query optimization Job and query optimization covers experience and skills to improve performance in the area of job-running mode, JVM reuse, job parallel running, and query optimizations in JOIN. Local mode Hadoop can run in standalone, pseudo-distributed, and fully distributed mode. Most of the time, we need to configure Hadoop to run in fully distributed mode. When the data to process is small, it is an overhead to start distributed data processing since the launching time of the fully distributed mode takes more time than the job processing time. Since Hive 0.7.0, Hive supports automatic conversion of a job to run in local mode with the following settings: jdbc:hive2://> SET hive.exec.mode.local.auto=true; --default falsejdbc:hive2://> SET hive.exec.mode.local.auto.inputbytes.max=50000000;jdbc:hive2://> SET hive.exec.mode.local.auto.input.files.max=5;--default 4 A job must satisfy the following conditions to run in the local mode: The total input size of the job is lower than hive.exec.mode.local.auto.inputbytes.max The total number of map tasks is less than hive.exec.mode.local.auto.input.files.max The total number of reduce tasks required is 1 or 0 JVM reuse By default, Hadoop launches a new JVM for each map or reduce job and runs the map or reduce task in parallel. When the map or reduce job is a lightweight job running only for a few seconds, the JVM startup process could be a significant overhead. The MapReduce framework (version 1 only, not Yarn) has an option to reuse JVM by sharing the JVM to run mapper/reducer serially instead of parallel. JVM reuse applies to map or reduce tasks in the same job. Tasks from different jobs will always run in a separate JVM. To enable the reuse, we can set the maximum number of tasks for a single job for JVM reuse using the mapred.job.reuse.jvm.num.tasks property. Its default value is 1: jdbc:hive2://> SET mapred.job.reuse.jvm.num.tasks=5; We can also set the value to –1 to indicate that all the tasks for a job will run in the same JVM. Parallel execution Hive queries commonly are translated into a number of stages that are executed by the default sequence. These stages are not always dependent on each other. Instead, they can run in parallel to save the overall job running time. We can enable this feature with the following settings: jdbc:hive2://> SET hive.exec.parallel=true; -- default falsejdbc:hive2://> SET hive.exec.parallel.thread.number=16;-- default 8, it defines the max number for running in parallel Parallel execution will increase the cluster utilization. If the utilization of a cluster is already very high, parallel execution will not help much in terms of overall performance. Join optimization Here, we'll briefly review the key settings for join improvement. Common join The common join is also called reduce side join. It is a basic join in Hive and works for most of the time. For common joins, we need to make sure the big table is on the right-most side or specified by hit, as follows: /*+ STREAMTABLE(stream_table_name) */. Map join Map join is used when one of the join tables is small enough to fit in the memory, so it is very fast but limited. Since Hive 0.7.0, Hive can convert map join automatically with the following settings: jdbc:hive2://> SET hive.auto.convert.join=true; --default falsejdbc:hive2://> SET hive.mapjoin.smalltable.filesize=600000000;--default 25Mjdbc:hive2://> SET hive.auto.convert.join.noconditionaltask=true;--default false. Set to true so that map join hint is not needed jdbc:hive2://> SET hive.auto.convert.join.noconditionaltask.size=10000000;--The default value controls the size of table to fit in memory Once autoconvert is enabled, Hive will automatically check if the smaller table file size is bigger than the value specified by hive.mapjoin.smalltable.filesize, and then Hive will convert the join to a common join. If the file size is smaller than this threshold, it will try to convert the common join into a map join. Once autoconvert join is enabled, there is no need to provide the map join hints in the query. Bucket map join Bucket map join is a special type of map join applied on the bucket tables. To enable bucket map join, we need to enable the following settings: jdbc:hive2://> SET hive.auto.convert.join=true; --default falsejdbc:hive2://> SET hive.optimize.bucketmapjoin=true; --default false In bucket map join, all the join tables must be bucket tables and join on buckets columns. In addition, the buckets number in bigger tables must be a multiple of the bucket number in the small tables. Sort merge bucket (SMB) join SMB is the join performed on the bucket tables that have the same sorted, bucket, and join condition columns. It reads data from both bucket tables and performs common joins (map and reduce triggered) on the bucket tables. We need to enable the following properties to use SMB: jdbc:hive2://> SET hive.input.format=. . . . . . .> org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;jdbc:hive2://> SET hive.auto.convert.sortmerge.join=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin.sortedmerge=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.noconditionaltask=true; Sort merge bucket map (SMBM) join SMBM join is a special bucket join but triggers map-side join only. It can avoid caching all rows in the memory like map join does. To perform SMBM joins, the join tables must have the same bucket, sort, and join condition columns. To enable such joins, we need to enable the following settings: jdbc:hive2://> SET hive.auto.convert.join=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join=truejdbc:hive2://> SET hive.optimize.bucketmapjoin=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin.sortedmerge=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.noconditionaltask=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.bigtable.selection.policy=org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ; Skew join When working with data that has a highly uneven distribution, the data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation. The following setting informs Hive to optimize properly if data skew happens: jdbc:hive2://> SET hive.optimize.skewjoin=true;--If there is data skew in join, set it to true. Default is false. jdbc:hive2://> SET hive.skewjoin.key=100000;--This is the default value. If the number of key is bigger than--this, the new keys will send to the other unused reducers. Skew data could happen on the GROUP BY data too. To optimize it, we need to do the following settings to enable skew data optimization in the GROUP BY result: SET hive.groupby.skewindata=true; Once configured, Hive will first trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew. For more information about Hive join optimization, please refer to the Apache Hive wiki available at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization and https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization. Summary In this article, we first covered how to identify performance bottlenecks using the EXPLAIN and ANALYZE statements. Then, we discussed job and query optimization in Hive. Resources for Article: Further resources on this subject: Apache Maven and m2eclipse [Article] Apache Karaf – Provisioning and Clusters [Article] Introduction to Apache ZooKeeper [Article]
Read more
  • 0
  • 0
  • 2339
article-image-scipy-signal-processing
Packt
03 Mar 2015
14 min read
Save for later

SciPy for Signal Processing

Packt
03 Mar 2015
14 min read
In this article by Sergio J. Rojas G. and Erik A Christensen, authors of the book Learning SciPy for Numerical and Scientific Computing - Second Edition, we will focus on the usage of some most commonly used routines that are included in SciPy modules—scipy.signal, scipy.ndimage, and scipy.fftpack, which are used for signal processing, multidimensional image processing, and computing Fourier transforms, respectively. We define a signal as data that measures either a time-varying or spatially varying phenomena. Sound or electrocardiograms are excellent examples of time-varying quantities, while images embody the quintessential spatially varying cases. Moving images are treated with the techniques of both types of signals, obviously. The field of signal processing treats four aspects of this kind of data: its acquisition, quality improvement, compression, and feature extraction. SciPy has many routines to treat effectively tasks in any of the four fields. All these are included in two low-level modules (scipy.signal being the main module, with an emphasis on time-varying data, and scipy.ndimage, for images). Many of the routines in these two modules are based on Discrete Fourier Transform of the data. SciPy has an extensive package of applications and definitions of these background algorithms, scipy.fftpack, which we will start covering first. (For more resources related to this topic, see here.) Discrete Fourier Transforms The Discrete Fourier Transform (DFT from now on) transforms any signal from its time/space domain into a related signal in the frequency domain. This allows us not only to be able to analyze the different frequencies of the data, but also for faster filtering operations, when used properly. It is possible to turn a signal in the frequency domain back to its time/spatial domain; thanks to the Inverse Fourier Transform. We will not go into detail of the mathematics behind these operators, since we assume familiarity at some level with this theory. We will focus on syntax and applications instead. The basic routines in the scipy.fftpack module compute the DFT and its inverse, for discrete signals in any dimension, which are fft and ifft (one dimension), fft2 and ifft2 (two dimensions), and fftn and ifftn (any number of dimensions). All of these routines assume that the data is complex valued. If we know beforehand that a particular dataset is actually real valued, and should offer real-valued frequencies, we use rfft and irfft instead, for a faster algorithm. All these routines are designed so that composition with their inverses always yields the identity. The syntax is the same in all cases, as follows: fft(x[, n, axis, overwrite_x]) The first parameter, x, is always the signal in any array-like form. Note that fft performs one-dimensional transforms. This means in particular, that if x happens to be two-dimensional, for example, fft will output another two-dimensional array where each row is the transform of each row of the original. We can change it to columns instead, with the optional parameter, axis. The rest of parameters are also optional; n indicates the length of the transform, and overwrite_x gets rid of the original data to save memory and resources. We usually play with the integer n when we need to pad the signal with zeros, or truncate it. For higher dimension, n is substituted by shape (a tuple), and axis by axes (another tuple). To better understand the output, it is often useful to shift the zero frequencies to the center of the output arrays with fftshift. The inverse of this operation, ifftshift, is also included in the module. The following code shows some of these routines in action, when applied to a checkerboard image: >>> import numpy >>> from scipy.fftpack import fft,fft2, fftshift >>> import matplotlib.pyplot as plt >>> B=numpy.ones((4,4)); W=numpy.zeros((4,4)) >>> signal = numpy.bmat("B,W;W,B") >>> onedimfft = fft(signal,n=16) >>> twodimfft = fft2(signal,shape=(16,16)) >>> plt.figure() >>> plt.gray() >>> plt.subplot(121,aspect='equal') >>> plt.pcolormesh(onedimfft.real) >>> plt.colorbar(orientation='horizontal') >>> plt.subplot(122,aspect='equal') >>> plt.pcolormesh(fftshift(twodimfft.real)) >>> plt.colorbar(orientation='horizontal') >>> plt.show() Note how the first four rows of the one-dimensional transform are equal (and so are the last four), while the two-dimensional transform (once shifted) presents a peak at the origin, and nice symmetries in the frequency domain. In the following screenshot (obtained from the preceding code), the left-hand side image is fft and the right-hand side image is fft2 of a 2 x 2 checkerboard signal: The scipy.fftpack module also offers the Discrete Cosine Transform with its inverse (dct, idct) as well as many differential and pseudo-differential operators defined in terms of all these transforms: diff (for derivative/integral), hilbert and ihilbert (for the Hilbert transform), tilbert and itilbert (for the h-Tilbert transform of periodic sequences), and so on. Signal construction To aid in the construction of signals with predetermined properties, the scipy.signal module has a nice collection of the most frequent one-dimensional waveforms in the literature: chirp and sweep_poly (for the frequency-swept cosine generator), gausspulse (a Gaussian modulated sinusoid) and sawtooth and square (for the waveforms with those names). They all take as their main parameter a one-dimensional ndarray representing the times at which the signal is to be evaluated. Other parameters control the design of the signal, according to frequency or time constraints. Let's take a look into the following code snippet, which illustrates the use of these one dimensional waveforms that we just discussed: >>> import numpy >>> from scipy.signal import chirp, sawtooth, square, gausspulse >>> import matplotlib.pyplot as plt >>> t=numpy.linspace(-1,1,1000) >>> plt.subplot(221); plt.ylim([-2,2]) >>> plt.plot(t,chirp(t,f0=100,t1=0.5,f1=200))   # plot a chirp >>> plt.subplot(222); plt.ylim([-2,2]) >>> plt.plot(t,gausspulse(t,fc=10,bw=0.5))     # Gauss pulse >>> plt.subplot(223); plt.ylim([-2,2]) >>> t*=3*numpy.pi >>> plt.plot(t,sawtooth(t))                     # sawtooth >>> plt.subplot(224); plt.ylim([-2,2]) >>> plt.plot(t,square(t))                       # Square wave >>> plt.show() Generated by this code, the following diagram shows waveforms for chirp (upper-left), gausspulse (upper-right), sawtooth (lower-left), and square (lower-right): The usual method of creating signals is to import them from the file. This is possible by using purely NumPy routines, for example fromfile: fromfile(file, dtype=float, count=-1, sep='') The file argument may point to either a file or a string, the count argument is used to determine the number of items to read, and sep indicates what constitutes a separator in the original file/string. For images, we have the versatile routine, imread in either the scipy.ndimage or scipy.misc module: imread(fname, flatten=False) The fname argument is a string containing the location of an image. The routine infers the type of file, and reads the data into an array, accordingly. In case the flatten argument is turned to True, the image is converted to gray scale. Note that, in order to work, the Python Imaging Library (PIL) needs to be installed. It is also possible to load .wav files for analysis, with the read and write routines from the wavfile submodule in the scipy.io module. For instance, given any audio file with this format, say audio.wav, the command, rate,data = scipy.io.wavfile.read("audio.wav"), assigns an integer value to the rate variable, indicating the sample rate of the file (in samples per second), and a NumPy ndarray to the data variable, containing the numerical values assigned to the different notes. If we wish to write some one-dimensional ndarray data into an audio file of this kind, with the sample rate given by the rate variable, we may do so by issuing the following command: >>> scipy.io.wavfile.write("filename.wav",rate,data) Filters A filter is an operation on signals that either removes features or extracts some component. SciPy has a very complete set of known filters, as well as the tools to allow construction of new ones. The complete list of filters in SciPy is long, and we encourage the reader to explore the help documents of the scipy.signal and scipy.ndimage modules for the complete picture. We will introduce in these pages, as an exposition, some of the most used filters in the treatment of audio or image processing. We start by creating a signal worth filtering: >>> from numpy import sin, cos, pi, linspace >>> f=lambda t: cos(pi*t) + 0.2*sin(5*pi*t+0.1) + 0.2*sin(30*pi*t)    + 0.1*sin(32*pi*t+0.1) + 0.1*sin(47* pi*t+0.8) >>> t=linspace(0,4,400); signal=f(t) We first test the classical smoothing filter of Wiener and Kolmogorov, wiener. We present in a plot, the original signal (in black) and the corresponding filtered data, with a choice of a Wiener window of the size 55 samples (in blue). Next, we compare the result of applying the median filter, medfilt, with a kernel of the same size as before (in red): >>> from scipy.signal import wiener, medfilt >>> import matplotlib.pylab as plt >>> plt.plot(t,signal,'k') >>> plt.plot(t,wiener(signal,mysize=55),'r',linewidth=3) >>> plt.plot(t,medfilt(signal,kernel_size=55),'b',linewidth=3) >>> plt.show() This gives us the following graph showing the comparison of smoothing filters (wiener is the one that has its starting point just below 0.5 and medfilt has its starting point just above 0.5): Most of the filters in the scipy.signal module can be adapted to work in arrays of any dimension. But in the particular case of images, we prefer to use the implementations in the scipy.ndimage module, since they are coded with these objects in mind. For instance, to perform a median filter on an image for smoothing, we use scipy.ndimage.median_filter. Let's see an example. We will start by loading Lena to the array and corrupting the image with Gaussian noise (zero mean and standard deviation of 16): >>> from scipy.stats import norm     # Gaussian distribution >>> import matplotlib.pyplot as plt >>> import scipy.misc >>> import scipy.ndimage >>> plt.gray() >>> lena=scipy.misc.lena().astype(float) >>> plt.subplot(221); >>> plt.imshow(lena) >>> lena+=norm(loc=0,scale=16).rvs(lena.shape) >>> plt.subplot(222); >>> plt.imshow(lena) >>> denoised_lena = scipy.ndimage.median_filter(lena,3) >>> plt.subplot(224); >>> plt.imshow(denoised_lena) The set of filters for images come in two flavors—statistical and morphological. For example, among the filters of statistical nature, we have the Sobel algorithm oriented to detection of edges (singularities along curves). Its syntax is as follows: sobel(image, axis=-1, output=None, mode='reflect', cval=0.0) The optional parameter, axis, indicates the dimension in which the computations are performed. By default, this is always the last axis (-1). The mode parameter, which is one of the strings 'reflect', 'constant', 'nearest', 'mirror', or 'wrap', indicates how to handle the border of the image, in case there is insufficient data to perform the computations there. In case the mode is 'constant', we may indicate the value to use in the border, with the cval parameter. Let's look into the following code snippet, which illustrates the use of the sobel filter: >>> from scipy.ndimage.filters import sobel >>> import numpy >>> lena=scipy.misc.lena() >>> sblX=sobel(lena,axis=0); sblY=sobel(lena,axis=1) >>> sbl=numpy.hypot(sblX,sblY) >>> plt.subplot(223); >>> plt.imshow(sbl) >>> plt.show() The following screenshot illustrates Lena (upper-left) and noisy Lena (upper-right) with the preceding two filters in action—edge map with sobel (lower-left) and median filter (lower-right): Morphology We also have the possibility of creating and applying filters to images based on mathematical morphology, both to binary and gray-scale images. The four basic morphological operations are opening (binary_opening), closing (binary_closing), dilation (binary_dilation), and erosion (binary_erosion). Note that the syntax for each of these filters is very simple, since we only need two ingredients—the signal to filter and the structuring element to perform the morphological operation. Let's take a look into the general syntax for these morphological operations: binary_operation(signal, structuring_element) We may use combinations of these four basic morphological operations to create more complex filters for removal of holes, hit-or-miss transforms (to find the location of specific patterns in binary images), denoising, edge detection, and many more. The SciPy module also allows for creating some common filters using the preceding syntax. For instance, for the location of the letter e in a text, we could use the following command instead: >>> binary_hit_or_miss(text, letterE) For comparative purposes, let's use this command in the following code snippet: >>> import numpy >>> import scipy.ndimage >>> import matplotlib.pylab as plt >>> from scipy.ndimage.morphology import binary_hit_or_miss >>> text = scipy.ndimage.imread('CHAP_05_input_textImage.png') >>> letterE = text[37:53,275:291] >>> HitorMiss = binary_hit_or_miss(text, structure1=letterE,    origin1=1) >>> eLocation = numpy.where(HitorMiss==True) >>> x=eLocation[1]; y=eLocation[0] >>> plt.imshow(text, cmap=plt.cm.gray, interpolation='nearest') >>> plt.autoscale(False) >>> plt.plot(x,y,'wo',markersize=10) >>> plt.axis('off') >>> plt.show() The output for the preceding lines of code is generated as follows: For gray-scale images, we may use a structuring element (structuring_element) or a footprint. The syntax is, therefore, a little different: grey_operation(signal, [structuring_element, footprint, size, ...]) If we desire to use a completely flat and rectangular structuring element (all ones), then it is enough to indicate the size as a tuple. For instance, to perform gray-scale dilation of a flat element of size (15,15) on our classical image of Lena, we issue the following command: >>> grey_dilation(lena, size=(15,15)) The last kind of morphological operations coded in the scipy.ndimage module perform distance and feature transforms. Distance transforms create a map that assigns to each pixel, the distance to the nearest object. Feature transforms provide with the index of the closest background element instead. These operations are used to decompose images into different labels. We may even choose different metrics such as Euclidean distance, chessboard distance, and taxicab distance. The syntax for the distance transform (distance_transform) using a brute force algorithm is as follows: distance_transform_bf(signal, metric='euclidean', sampling=None, return_distances=True, return_indices=False,                      distances=None, indices=None) We indicate the metric with the strings such as 'euclidean', 'taxicab', or 'chessboard'. If we desire to provide the feature transform instead, we switch return_distances to False and return_indices to True. Similar routines are available with more sophisticated algorithms—distance_transform_cdt (using chamfering for taxicab and chessboard distances). For Euclidean distance, we also have distance_transform_edt. All these use the same syntax. Summary In this article, we explored signal processing (any dimensional) including the treatment of signals in frequency space, by means of their Discrete Fourier Transforms. These correspond to the fftpack, signal, and ndimage modules. Resources for Article: Further resources on this subject: Signal Processing Techniques [article] SciPy for Computational Geometry [article] Move Further with NumPy Modules [article]
Read more
  • 0
  • 0
  • 13934

article-image-basics-programming-julia
Packt
03 Mar 2015
17 min read
Save for later

Basics of Programming in Julia

Packt
03 Mar 2015
17 min read
 In this article by Ivo Balbaert, author of the book Getting Started with Julia Programming, we will explore how Julia interacts with the outside world, reading from standard input and writing to standard output, files, networks, and databases. Julia provides asynchronous networking I/O using the libuv library. We will see how to handle data in Julia. We will also discover the parallel processing model of Julia. In this article, the following topics are covered: Working with files (including the CSV files) Using DataFrames (For more resources related to this topic, see here.) Working with files To work with files, we need the IOStream type. IOStream is a type with the supertype IO and has the following characteristics: The fields are given by names(IOStream) 4-element Array{Symbol,1}:  :handle   :ios    :name   :mark The types are given by IOStream.types (Ptr{None}, Array{Uint8,1}, String, Int64) The file handle is a pointer of the type Ptr, which is a reference to the file object. Opening and reading a line-oriented file with the name example.dat is very easy: // code in Chapter 8io.jl fname = "example.dat"                                 f1 = open(fname) fname is a string that contains the path to the file, using escaping of special characters with when necessary; for example, in Windows, when the file is in the test folder on the D: drive, this would become d:\test\example.dat. The f1 variable is now an IOStream(<file example.dat>) object. To read all lines one after the other in an array, use data = readlines(f1), which returns 3-element Array{Union(ASCIIString,UTF8String),1}: "this is line 1.rn" "this is line 2.rn" "this is line 3." For processing line by line, now only a simple loop is needed: for line in data   println(line) # or process line end close(f1) Always close the IOStream object to clean and save resources. If you want to read the file into one string, use readall. Use this only for relatively small files because of the memory consumption; this can also be a potential problem when using readlines. There is a convenient shorthand with the do syntax for opening a file, applying a function process, and closing it automatically. This goes as follows (file is the IOStream object in this code): open(fname) do file     process(file) end The do command creates an anonymous function, and passes it to open. Thus, the previous code example would have been equivalent to open(process, fname). Use the same syntax for processing a file fname line by line without the memory overhead of the previous methods, for example: open(fname) do file     for line in eachline(file)         print(line) # or process line     end end Writing a file requires first opening it with a "w" flag, then writing strings to it with write, print, or println, and then closing the file handle that flushes the IOStream object to the disk: fname =   "example2.dat" f2 = open(fname, "w") write(f2, "I write myself to a filen") # returns 24 (bytes written) println(f2, "even with println!") close(f2) Opening a file with the "w" option will clear the file if it exists. To append to an existing file, use "a". To process all the files in the current folder (or a given folder as an argument to readdir()), use this for loop: for file in readdir()   # process file end Reading and writing CSV files A CSV file is a comma-separated file. The data fields in each line are separated by commas "," or another delimiter such as semicolons ";". These files are the de-facto standard for exchanging small and medium amounts of tabular data. Such files are structured so that one line contains data about one data object, so we need a way to read and process the file line by line. As an example, we will use the data file Chapter 8winequality.csv that contains 1,599 sample measurements, 12 data columns, such as pH and alcohol per sample, separated by a semicolon. In the following screenshot, you can see the top 20 rows:   In general, the readdlm function is used to read in the data from the CSV files: # code in Chapter 8csv_files.jl: fname = "winequality.csv" data = readdlm(fname, ';') The second argument is the delimiter character (here, it is ;). The resulting data is a 1600x12 Array{Any,2} array of the type Any because no common type could be found:     "fixed acidity"   "volatile acidity"      "alcohol"   "quality"      7.4                        0.7                                9.4              5.0      7.8                        0.88                              9.8              5.0      7.8                        0.76                              9.8              5.0   … If the data file is comma separated, reading it is even simpler with the following command: data2 = readcsv(fname) The problem with what we have done until now is that the headers (the column titles) were read as part of the data. Fortunately, we can pass the argument header=true to let Julia put the first line in a separate array. It then naturally gets the correct datatype, Float64, for the data array. We can also specify the type explicitly, such as this: data3 = readdlm(fname, ';', Float64, 'n', header=true) The third argument here is the type of data, which is a numeric type, String or Any. The next argument is the line separator character, and the fifth indicates whether or not there is a header line with the field (column) names. If so, then data3 is a tuple with the data as the first element and the header as the second, in our case, (1599x12 Array{Float64,2}, 1x12 Array{String,2}) (There are other optional arguments to define readdlm, see the help option). In this case, the actual data is given by data3[1] and the header by data3[2]. Let's continue working with the variable data. The data forms a matrix, and we can get the rows and columns of data using the normal array-matrix syntax). For example, the third row is given by row3 = data[3, :] with data:  7.8  0.88  0.0  2.6  0.098  25.0  67.0  0.9968  3.2  0.68  9.8  5.0, representing the measurements for all the characteristics of a certain wine. The measurements of a certain characteristic for all wines are given by a data column, for example, col3 = data[ :, 3] represents the measurements of citric acid and returns a column vector 1600-element Array{Any,1}:   "citric acid" 0.0  0.0  0.04  0.56  0.0  0.0 …  0.08  0.08  0.1  0.13  0.12  0.47. If we need columns 2-4 (volatile acidity to residual sugar) for all wines, extract the data with x = data[:, 2:4]. If we need these measurements only for the wines on rows 70-75, get these with y = data[70:75, 2:4], returning a 6 x 3 Array{Any,2} outputas follows: 0.32   0.57  2.0 0.705  0.05  1.9 … 0.675  0.26  2.1 To get a matrix with the data from columns 3, 6, and 11, execute the following command: z = [data[:,3] data[:,6] data[:,11]] It would be useful to create a type Wine in the code. For example, if the data is to be passed around functions, it will improve the code quality to encapsulate all the data in a single data type, like this: type Wine     fixed_acidity::Array{Float64}     volatile_acidity::Array{Float64}     citric_acid::Array{Float64}     # other fields     quality::Array{Float64} end Then, we can create objects of this type to work with them, like in any other object-oriented language, for example, wine1 = Wine(data[1, :]...), where the elements of the row are splatted with the ... operator into the Wine constructor. To write to a CSV file, the simplest way is to use the writecsv function for a comma separator, or the writedlm function if you want to specify another separator. For example, to write an array data to a file partial.dat, you need to execute the following command: writedlm("partial.dat", data, ';') If more control is necessary, you can easily combine the more basic functions from the previous section. For example, the following code snippet writes 10 tuples of three numbers each to a file: // code in Chapter 8tuple_csv.jl fname = "savetuple.csv" csvfile = open(fname,"w") # writing headers: write(csvfile, "ColName A, ColName B, ColName Cn") for i = 1:10   tup(i) = tuple(rand(Float64,3)...)   write(csvfile, join(tup(i),","), "n") end close(csvfile) Using DataFrames If you measure n variables (each of a different type) of a single object of observation, then you get a table with n columns for each object row. If there are m observations, then we have m rows of data. For example, given the student grades as data, you might want to know "compute the average grade for each socioeconomic group", where grade and socioeconomic group are both columns in the table, and there is one row per student. The DataFrame is the most natural representation to work with such a (m x n) table of data. They are similar to pandas DataFrames in Python or data.frame in R. A DataFrame is a more specialized tool than a normal array for working with tabular and statistical data, and it is defined in the DataFrames package, a popular Julia library for statistical work. Install it in your environment by typing in Pkg.add("DataFrames") in the REPL. Then, import it into your current workspace with using DataFrames. Do the same for the packages DataArrays and RDatasets (which contains a collection of example datasets mostly used in the R literature). A common case in statistical data is that data values can be missing (the information is not known). The DataArrays package provides us with the unique value NA, which represents a missing value, and has the type NAtype. The result of the computations that contain the NA values mostly cannot be determined, for example, 42 + NA returns NA. (Julia v0.4 also has a new Nullable{T} type, which allows you to specify the type of a missing value). A DataArray{T} array is a data structure that can be n-dimensional, behaves like a standard Julia array, and can contain values of the type T, but it can also contain the missing (Not Available) values NA and can work efficiently with them. To construct them, use the @data macro: // code in Chapter 8dataarrays.jl using DataArrays using DataFrames dv = @data([7, 3, NA, 5, 42]) This returns 5-element DataArray{Int64,1}: 7  3   NA  5 42. The sum of these numbers is given by sum(dv) and returns NA. One can also assign the NA values to the array with dv[5] = NA; then, dv becomes [7, 3, NA, 5, NA]). Converting this data structure to a normal array fails: convert(Array, dv) returns ERROR: NAException. How to get rid of these NA values, supposing we can do so safely? We can use the dropna function, for example, sum(dropna(dv)) returns 15. If you know that you can replace them with a value v, use the array function: repl = -1 sum(array(dv, repl)) # returns 13 A DataFrame is a kind of an in-memory database, versatile in the ways you can work with the data. It consists of columns with names such as Col1, Col2, Col3, and so on. Each of these columns are DataArrays that have their own type, and the data they contain can be referred to by the column names as well, so we have substantially more forms of indexing. Unlike two-dimensional arrays, columns in a DataFrame can be of different types. One column might, for instance, contain the names of students and should therefore be a string. Another column could contain their age and should be an integer. We construct a DataFrame from the program data as follows: // code in Chapter 8dataframes.jl using DataFrames # constructing a DataFrame: df = DataFrame() df[:Col1] = 1:4 df[:Col2] = [e, pi, sqrt(2), 42] df[:Col3] = [true, false, true, false] show(df) Notice that the column headers are used as symbols. This returns the following 4 x 3 DataFrame object: We could also have used the full constructor as follows: df = DataFrame(Col1 = 1:4, Col2 = [e, pi, sqrt(2), 42],    Col3 = [true, false, true, false]) You can refer to the columns either by an index (the column number) or by a name, both of the following expressions return the same output: show(df[2]) show(df[:Col2]) This gives the following output: [2.718281828459045, 3.141592653589793, 1.4142135623730951,42.0] To show the rows or subsets of rows and columns, use the familiar splice (:) syntax, for example: To get the first row, execute df[1, :]. This returns 1x3 DataFrame.  | Row | Col1 | Col2    | Col3 |  |-----|------|---------|------|  | 1   | 1    | 2.71828 | true | To get the second and third row, execute df [2:3, :] To get only the second column from the previous result, execute df[2:3, :Col2]. This returns [3.141592653589793, 1.4142135623730951]. To get the second and third column from the second and third row, execute df[2:3, [:Col2, :Col3]], which returns the following output: 2x2 DataFrame  | Row | Col2    | Col3  |  |---- |-----   -|-------|  | 1   | 3.14159 | false |  | 2   | 1.41421 | true  | The following functions are very useful when working with DataFrames: The head(df) and tail(df) functions show you the first six and the last six lines of data respectively. The names function gives the names of the columns names(df). It returns 3-element Array{Symbol,1}:  :Col1  :Col2  :Col3. The eltypes function gives the data types of the columns eltypes(df). It gives the output as 3-element Array{Type{T<:Top},1}:  Int64  Float64  Bool. The describe function tries to give some useful summary information about the data in the columns, depending on the type, for example, describe(df) gives for column 2 (which is numeric) the min, max, median, mean, number, and percentage of NAs: Col2 Min      1.4142135623730951 1st Qu.  2.392264761937558  Median   2.929937241024419 Mean     12.318522011105483  3rd Qu.  12.856194490192344  Max      42.0  NAs      0  NA%      0.0% To load in data from a local CSV file, use the method readtable. The returned object is of type DataFrame: // code in Chapter 8dataframes.jl using DataFrames fname = "winequality.csv" data = readtable(fname, separator = ';') typeof(data) # DataFrame size(data) # (1599,12) Here is a fraction of the output: The readtable method also supports reading in gzipped CSV files. Writing a DataFrame to a file can be done with the writetable function, which takes the filename and the DataFrame as arguments, for example, writetable("dataframe1.csv", df). By default, writetable will use the delimiter specified by the filename extension and write the column names as headers. Both readtable and writetable support numerous options for special cases. Refer to the docs for more information (refer to http://dataframesjl.readthedocs.org/en/latest/). To demonstrate some of the power of DataFrames, here are some queries you can do: Make a vector with only the quality information data[:quality] Give the wines with alcohol percentage equal to 9.5, for example, data[ data[:alcohol] .== 9.5, :] Here, we use the .== operator, which does element-wise comparison. data[:alcohol] .== 9.5 returns an array of Boolean values (true for datapoints, where :alcohol is 9.5, and false otherwise). data[boolean_array, : ] selects those rows where boolean_array is true. Count the number of wines grouped by quality with by(data, :quality, data -> size(data, 1)), which returns the following: 6x2 DataFrame | Row | quality | x1  | |-----|---------|-----| | 1    | 3      | 10  | | 2    | 4      | 53  | | 3    | 5      | 681 | | 4    | 6      | 638 | | 5    | 7      | 199 | | 6    | 8      | 18  | The DataFrames package contains the by function, which takes in three arguments: A DataFrame, here it takes data A column to split the DataFrame on, here it takes quality A function or an expression to apply to each subset of the DataFrame, here data -> size(data, 1), which gives us the number of wines for each quality value Another easy way to get the distribution among quality is to execute the histogram hist function hist(data[:quality]) that gives the counts over the range of quality (2.0:1.0:8.0,[10,53,681,638,199,18]). More precisely, this is a tuple with the first element corresponding to the edges of the histogram bins, and the second denoting the number of items in each bin. So there are, for example, 10 wines with quality between 2 and 3, and so on. To extract the counts as a variable count of type Vector, we can execute _, count = hist(data[:quality]); the _ means that we neglect the first element of the tuple. To obtain the quality classes as a DataArray class, we will execute the following: class = sort(unique(data[:quality])) We can now construct a df_quality DataFrame with the class and count columns as df_quality = DataFrame(qual=class, no=count). This gives the following output: 6x2 DataFrame | Row | qual | no  | |-----|------|-----| | 1   | 3    | 10  | | 2   | 4    | 53  | | 3   | 5    | 681 | | 4   | 6    | 638 | | 5   | 7    | 199 | | 6   | 8    | 18  | To deepen your understanding and learn about the other features of Julia DataFrames (such as joining, reshaping, and sorting), refer to the documentation available at http://dataframesjl.readthedocs.org/en/latest/. Other file formats Julia can work with other human-readable file formats through specialized packages: For JSON, use the JSON package. The parse method converts the JSON strings into Dictionaries, and the json method turns any Julia object into a JSON string. For XML, use the LightXML package For YAML, use the YAML package For HDF5 (a common format for scientific data), use the HDF5 package For working with Windows INI files, use the IniFile package Summary In this article we discussed the basics of network programming in Julia. Resources for Article: Further resources on this subject: Getting Started with Electronic Projects? [article] Getting Started with Selenium Webdriver and Python [article] Handling The Dom In Dart [article]
Read more
  • 0
  • 0
  • 18945
Modal Close icon
Modal Close icon