Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-introduction-scala
Packt
01 Nov 2016
8 min read
Save for later

Introduction to Scala

Packt
01 Nov 2016
8 min read
In this article by Diego Pacheco, the author of the book, Building applications with Scala, we will see the following topics: Writing a program for Scala Hello World using the REPL Scala language – the basics Scala variables – var and val Creating immutable variables (For more resources related to this topic, see here.) Scala Hello World using the REPL Let's get started. Go ahead, open your terminal, and type $ scala in order to open the Scala REPL. Once the REPL is open, you can just type "Hello World". By doing this, you are performing two operations – eval and print. The Scala REPL will create a variable called res0 and store your string there, and then it will print the content of the res0 variable. Scala REPL Hello World program $ scala Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77). Type in expressions for evaluation. Or try :help. scala> "Hello World" res0: String = Hello World scala> Scala is a hybrid language, which means it is both object-oriented (OO) and functional. You can create classes and objects in Scala. Next, we will create a complete Hello World application using classes. Scala OO Hello World program $ scala Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77). Type in expressions for evaluation. Or try :help. scala> object HelloWorld { | def main(args:Array[String]) = println("Hello World") | } defined object HelloWorld scala> HelloWorld.main(null) Hello World scala> First things first, you need to realize that we use the word object instead of class. The Scala language has different constructs, compared with Java. Object is a Singleton in Scala. It's the same as you code the Singleton pattern in Java. Next, we see the word def that is used in Scala to create functions. In this program, we create the main function just as we do in Java, and we call the built-in function, println, in order to print the String Hello World. Scala imports some java objects and packages by default. Coding in Scala does not require you to type, for instance, System.out.println("Hello World"), but you can if you want to, as shown in the following:. $ scala Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77). Type in expressions for evaluation. Or try :help. scala> System.out.println("Hello World") Hello World scala> We can and we will do better. Scala has some abstractions for a console application. We can write this code with less lines of code. To accomplish this goal, we need to extend the Scala class App. When we extend from App, we are performing inheritance, and we don't need to define the main function. We can just put all the code on the body of the class, which is very convenient, and which makes the code clean and simple to read. Scala HelloWorld App in the Scala REPL $ scala Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77). Type in expressions for evaluation. Or try :help. scala> object HelloWorld extends App { | println("Hello World") | } defined object HelloWorld scala> HelloWorld object HelloWorld scala> HelloWorld.main(null) Hello World scala> After coding the HelloWorld object in the Scala REPL, we can ask the REPL what HelloWorld is and, as you might realize, the REPL answers that HelloWorld is an object. This is a very convenient Scala way to code console applications because we can have a Hello World application with just three lines of code. Sadly, the same program in Java requires way more code, as you will see in the next section. Java is a great language for performance, but it is a verbose language compared with Scala. Java Hello World application package scalabook.javacode.chap1; public class HelloWorld { public static void main(String args[]){ System.out.println("Hello World"); } } The Java application required six lines of code, while in Scala, we were able to do the same with 50% less code(three lines of code). This is a very simple application; when we are coding complex applications, the difference gets bigger as a Scala application ends up with way lesser code than that of Java. Remember that we use an object in Scala in order to have a Singleton(Design Pattern that makes sure you have just one instance of a class), and if we want to do the same in Java, the code would be something like this: package scalabook.javacode.chap1; public class HelloWorldSingleton { private HelloWorldSingleton(){} private static class SingletonHelper{ private static final HelloWorldSingleton INSTANCE = new HelloWorldSingleton(); } public static HelloWorldSingleton getInstance(){ return SingletonHelper.INSTANCE; } public void sayHello(){ System.out.println("Hello World"); } public static void main(String[] args) { getInstance().sayHello(); } } It's not just about the size of the code, but it is all about consistency and the language providing more abstractions for you. If you write less code, you will have less bugs in your software at the end of the day. Scala language – the basics Scala is a statically typed language with a very expressive type system, which enforces abstractions in a safe yet coherent manner. All values in Scala are Java objects (but primitives that are unboxed at runtime) because at the end of the day, Scala runs on the Java JVM. Scala enforces immutability as a core functional programing principle. This enforcement happens in multiple aspects of the Scala language, for instance, when you create a variable, you do it in an immutable way, and when you use a collection, you use an immutable collection. Scala also lets you use mutable variables and mutable structures, but it favors immutable ones by design. Scala variables – var and val When you are coding in Scala, you create variables using either the var operator or the val operator. The var operator allows you to create mutable states, which is fine as long as you make it local, stick to the core functional programing principles, and avoid mutable shared state. Using var in the Scala REPL $ scala Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77). Type in expressions for evaluation. Or try :help. scala> var x = 10 x: Int = 10 scala> x res0: Int = 10 scala> x = 11 x: Int = 11 scala> x res1: Int = 11 scala> However, Scala has a more interesting construct called val. Using the val operator makes your variables immutable, which means that you can't change their values after you set them. If you try to change the value of a val variable in Scala, the compiler will give you an error. As a Scala developer, you should use val as much as possible because that's a good functional programing mindset, and it will make your programs better and more correct. In Scala, everything is an object; there are no primitives – the var and val rules apply for everything, be it Int, String, or even a class. Using val in the Scala REPL $ scala Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77). Type in expressions for evaluation. Or try :help. scala> val x = 10 x: Int = 10 scala> x res0: Int = 10 scala> x = 11 <console>:12: error: reassignment to val x = 11 ^ scala> x res1: Int = 10 scala> Creating immutable variables Right. Now let's see how we can define the most common types in Scala, such as Int, Double, Boolean, and String. Remember that you can create these variables using val or var, depending on your requirement. Scala variable types at the Scala REPL $ scala Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77). Type in expressions for evaluation. Or try :help. scala> val x = 10 x: Int = 10 scala> val y = 11.1 y: Double = 11.1 scala> val b = true b: Boolean = true scala> val f = false f: Boolean = false scala> val s = "A Simple String" s: String = A Simple String scala> For these variables, we did not define the type. The Scala language figures it out for us. However, it is possible to specify the type if you want. In Scala, the type comes after the name of the variable, as shown in the following section. Scala variables with explicit typing at the Scala REPL $ scala Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77). Type in expressions for evaluation. Or try :help. scala> val x:Int = 10 x: Int = 10 scala> val y:Double = 11.1 y: Double = 11.1 scala> val s:String = "My String " s: String = "My String " scala> val b:Boolean = true b: Boolean = true scala> Summary In this article, we learned about some basic constructs and concepts of the Scala language, with functions, collections, and OO in Scala. Resources for Article: Further resources on this subject: Making History with Event Sourcing [article] Creating Your First Plug-in [article] Content-based recommendation [article]
Read more
  • 0
  • 0
  • 14114

article-image-container-linking-and-docker-dns
Packt
01 Nov 2016
29 min read
Save for later

Container Linking and Docker DNS

Packt
01 Nov 2016
29 min read
In this article by Jon Langemak, the author of the book Docker Networking Cookbook, has covered the following recipes: Verifying host based DNS configuration inside a container Overriding the default name resolution settings Configuring links for name and service resolution Leveraging Docker DNS (For more resources related to this topic, see here.) Verifying host based DNS configuration inside a container While you might not realize it but Docker, by default, is providing your containers a means to do basic name resolution. Docker passes name resolution from the Docker host, directly into the container. The result is that a spawned container can natively resolve anything that the Docker host itself can. The mechanics used by Docker to achieve name resolution in a container are elegantly simple. In this recipe, we'll walk through how this is done and how you can verify that it's working as expected. Getting Ready In this recipe we'll be demonstrating the configuration on a single Docker host. It is assumed that this host has Docker installed and that Docker is in its default configuration. We'll be altering name resolution settings on the host so you'll need root level access. How to do it… To start with, let's start a new container on our host docker1 and examine how the container handles name resolution: user@docker1:~$ docker run -d -P --name=web8 jonlangemak/web_server_8_dns d65baf205669c871d1216dc091edd1452a318b6522388e045c211344815c280a user@docker1:~$ user@docker1:~$ docker exec web8 host www.google.com www.google.com has address 216.58.216.196 www.google.com has IPv6 address 2607:f8b0:4009:80e::2004 user@docker1:~ $ It would appear that the container has the ability to resolve DNS names. If we look at our local Docker host and run the same test, we should get similar results: user@docker1:~$ host www.google.com www.google.com has address 216.58.216.196 www.google.com has IPv6 address 2607:f8b0:4009:80e::2004 user@docker1:~$ In addition, just like our Docker host, the container can also resolve local DNS records associated with the local domain lab.lab: user@docker1:~$ docker exec web8 host docker4 docker4.lab.lab has address 192.168.50.102 user@docker1:~$ You'll notice that we didn't need to specify a fully qualified domain name in order to resolve the host name docker4 in the domain lab.lab. At this point it's safe to assume that the container is receiving some sort of intelligent update from the Docker host which provides it relevant information about the local DNS configuration. In case you don't know, the resolv.conf file is generally where you define a Linux system's name resolution parameters. In many cases it is altered automatically by configuration information in other places. However – regardless of how it's altered, it should always be the source of truth for how the system handles name resolution. To see what the container is receiving, let's examine the containers resolv.conf file: user@docker1:~$ docker exec -t web8 more /etc/resolv.conf :::::::::::::: /etc/resolv.conf :::::::::::::: # Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8) # DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN nameserver 10.20.30.13 search lab.lab user@docker1:~$ As you can see, the container has learned that the local DNS server is 10.20.30.13 and that the local DNS search domain is lab.lab. Where did it get this information? The solution is rather simple. When a container starts, Docker generates instances of the following three files for each container spawned and saves it with the container configuration: /etc/hostname /etc/hosts /etc/resolv.conf These files are stored as part of the container configuration and then mounted into the container. We can use findmnt tool from within the container to examine the source of the mounts: root@docker1:~# docker exec web8 findmnt -o SOURCE …<Additional output removed for brevity>… /dev/mapper/docker1--vg-root[/var/lib/docker/containers/c803f130b7a2450609672c23762bce3499dec9abcfdc540a43a7eb560adaf62a/resolv.conf /dev/mapper/docker1--vg-root[/var/lib/docker/containers/c803f130b7a2450609672c23762bce3499dec9abcfdc540a43a7eb560adaf62a/hostname] /dev/mapper/docker1--vg-root[/var/lib/docker/containers/c803f130b7a2450609672c23762bce3499dec9abcfdc540a43a7eb560adaf62a/hosts] root@docker1:~# So while the container thinks it has local copies of the hostname, hosts, and resolv.conf, file in its /etc/ directory, the real files are actually located in the containers configuration directory (/var/lib/docker/containers/) on the Docker host. When you tell Docker to run a container, it does 3 things: It examines the Docker hosts /etc/resolv.conf file and places a copy of it in the containers directory. It creates a hostname file in the containers directory and assigns the container a unique hostname. It creates a hosts file in the containers directory and adds relevant records including localhost and a record referencing the host itself. Each time the container is restarted, the container's resolv.conf file is updated based on the values found in the Docker hosts resolv.conf file. This means that any changes made to the resolv.conf file are lost each time the container is restarted. The hostname and hosts configuration files are also rewritten each time the container is restarted losing any changes made during the previous run. To validate the configuration files a given container is using we can inspect the containers configuration for these variables: user@docker1:~$ docker inspect web8 | grep HostsPath "HostsPath": "/var/lib/docker/containers/c803f130b7a2450609672c23762bce3499dec9abcfdc540a43a7eb560adaf62a/hosts", user@docker1:~$ docker inspect web8 | grep HostnamePath "HostnamePath": "/var/lib/docker/containers/c803f130b7a2450609672c23762bce3499dec9abcfdc540a43a7eb560adaf62a/hostname", user@docker1:~$ docker inspect web8 | grep ResolvConfPath "ResolvConfPath": "/var/lib/docker/containers/c803f130b7a2450609672c23762bce3499dec9abcfdc540a43a7eb560adaf62a/resolv.conf", user@docker1:~$ As expected, these are the same mount paths we saw when we ran the findmnt command from within the container itself. These represent the exact mount path for each file into the containers /etc/ directory for each respective file. Overriding the default name resolution settings The method Docker uses for providing name resolution to containers works very well in most cases. However, there could be some instances where you want Docker to provide the containers with a DNS server other than the one the Docker host is configured to use. In these cases, Docker offers you a couple of options. You can tell the Docker service to provide a different DNS server for all of the containers the service spawns. You can also manually override this setting at container runtime by providing a DNS server as an option to the docker run subcommand. In this recipe, we'll show you your options for changing the default name resolution behavior as well as how to verify the settings worked. Getting Ready In this recipe we'll be demonstrating the configuration on a single Docker host. It is assumed that this host has Docker installed and that Docker is in its default configuration. We'll be altering name resolution settings on the host so you'll need root level access. How to do it… As we saw in the first recipe in this article, by default, Docker provides containers with the DNS server that the Docker host itself uses. This comes in the form of copying the host's resolv.conf file and providing it to each spawned container. Along with the name server setting, this file also includes definitions for DNS search domains. Both of these options can be configured at the service level to cover any spawned containers as well as at the individual level. For the purpose of comparison, let's start by examining the Docker hosts DNS configuration: root@docker1:~# more /etc/resolv.conf # Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8) # DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN nameserver 10.20.30.13 search lab.lab root@docker1:~# With this configuration, we would expect that any container spawned on this host would receive the same name server and DNS search domain. Let's spawn a container called web8 to verify this is working as expected: root@docker1:~# docker run -d -P --name=web8 jonlangemak/web_server_8_dns 156bc29d28a98e2fbccffc1352ec390bdc8b9b40b84e4c5f58cbebed6fb63474 root@docker1:~# root@docker1:~# docker exec -t web8 more /etc/resolv.conf :::::::::::::: /etc/resolv.conf :::::::::::::: # Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8) # DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN nameserver 10.20.30.13 search lab.lab As expected, the container receives the same configuration. Let's now inspect the container and see if we see any DNS related options defined: user@docker1:~$ docker inspect web8 | grep Dns "Dns": [], "DnsOptions": [], "DnsSearch": [], user@docker1:~$ Since we're using the default configuration, there is no reason to configure anything specific within the container in regards to DNS server or search domain. Each time the container starts, Docker will apply the settings for the hosts resolv.conf file to the containers DNS configuration files. If we'd prefer to have Docker give containers a different DNS server or DNS search domain, we can do so through Docker options. In this case, the two we're interested in are: --dns=<DNS Server> - Specify a DNS server address that Docker should provide to the containers. --dns-search=<DNS Search Domain> - Specify a DNS search domain that Docker should provide to the containers. Let's configure Docker to provide containers with a public DNS server (4.2.2.2) and a search domain of lab.external. We can do so by passing the following options to the Docker systemd drop-in file: ExecStart=/usr/bin/dockerd --dns=4.2.2.2 --dns-search=lab.external Once the options are configured, reload the systemd configuration, restart the service to load the new options, and restart our container web8: user@docker1:~$ sudo systemctl daemon-reload user@docker1:~$ sudo systemctl restart docker user@docker1:~$ docker start web8 web8 user@docker1:~$ docker exec -t web8 more /etc/resolv.conf search lab.external nameserver 4.2.2.2 user@docker1:~$ You'll note that despite this container initially having the hosts DNS server (10.20.30.13) and search domain (lab.lab) it now has the service level DNS options we just specified. If you recall earlier, we saw that when we inspected this container, it didn't define a specific DNS server or search domain. Since none was specified, Docker now uses the settings from the Docker options which take priority. While this provides some level of flexibility, it's not yet truly flexible. At this point any and all containers spawned on this server will be provided the same DNS server and search domain. To be truly flexible we should be able to have Docker alter the name resolution configuration on a per container level. As luck would have it, these options can also be provided directly at container runtime. The preceding image defines the priority Docker uses when deciding what name resolution settings to apply to a container when it's started. Settings defined at container runtime always take priority. If the settings aren't defined there, Docker then looks to see if they are configured at the service level. If the settings aren't there, it falls back to the default method of relying on the Docker hosts DNS settings. For instance, we can launch a container called web2 and provide different options: root@docker1:~# docker run -d --dns=8.8.8.8 --dns-search=lab.dmz -P --name=web8-2 jonlangemak/web_server_8_dns 1e46d66a47b89d541fa6b022a84d702974414925f5e2dd56eeb840c2aed4880f root@docker1:~# If we inspect the container, we'll see that we now have dns and dns-search fields defined as part of the container configuration: root@docker1:~# docker inspect web8-2 …<output removed for brevity>… "Dns": [ "8.8.8.8" ], "DnsOptions": [], "DnsSearch": [ "lab.dmz" ], …<output removed for brevity>… root@docker1:~# This ensures that if the container is restarted, it will still have the same DNS settings that were initially provided the first time the container was run. Let's make some slight changes to the Docker service to verify the priority is working as expected. Let's change our Docker options to look like this: ExecStart=/usr/bin/dockerd --dns-search=lab.external Now restart the service and run the following container: user@docker1:~$ sudo systemctl daemon-reload user@docker1:~$ sudo systemctl restart docker root@docker1:~# root@docker1:~# docker run -d -P --name=web8-3 jonlangemak/web_server_8_dns 5e380f8da17a410eaf41b772fde4e955d113d10e2794512cd20aa5e551d9b24c root@docker1:~# Since we didn't provide any DNS related options at container run time the next place we'd check would be the service level options. Our Docker service level options include a DNS search domain of lab.external, we'd expect the container to receive that search domain. However, since we don't have a DNS server defined, we'll need to fall back to the one configured on the Docker host itself. And now examine its resolv.conf file to make sure things worked as expected: user@docker1:~$ docker exec -t web8-3 more /etc/resolv.conf search lab.external nameserver 10.20.30.13 user@docker1:~$ Configuring Links for name and service resolution Container linking provides a means for one container to easily communicate with another container on the same host. As we've seen in previous examples, most container to container communication has occurred through IP addresses. Container linking improves on this by allowing linked containers to communicate with each other by name. In addition to providing basic name resolution, it also provides a means to see what services a linked container is providing. In this recipe we'll review how to create container links as well as discuss some of their limitations. Getting Ready In this recipe we'll be demonstrating the configuration on a single Docker host. It is assumed that this host has Docker installed and that Docker is in its default configuration. We'll be altering name resolution settings on the host so you'll need root level access. How to do it… The phrase container linking might imply to some that it involves some kind of network configuration or modification. In reality, container linking has very little to do with container networking. In the default mode, container linking provides a means for one container to resolve the name of another. For instance, let's start two containers on our lab host docker1: root@docker1:~# docker run -d -P --name=web1 jonlangemak/web_server_1 88f9c862966874247c8e2ba90c18ac673828b5faac93ff08090adc070f6d2922 root@docker1:~# docker run -d -P --name=web2 --link=web1 jonlangemak/web_server_2 00066ea46367c07fc73f73bdcdff043bd4c2ac1d898f4354020cbcfefd408449 root@docker1:~# Notice how when I started the second container I used a new flag called --link and referenced the container web1. We would now say that web2 was linked to web1. However, they're not really linked in any sort of way. A better description might be to say that web2 is now aware of web1. Let's connect to the container web2 to show you what I mean: root@docker1:~# docker exec -it web2 /bin/bash root@00066ea46367:/# ping web1 -c 2 PING web1 (172.17.0.2): 48 data bytes 56 bytes from 172.17.0.2: icmp_seq=0 ttl=64 time=0.163 ms 56 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=0.092 ms --- web1 ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.092/0.128/0.163/0.036 ms root@00066ea46367:/# It appears that the web2 container is now able to resolve the container web1 by name. This is because the linking process inserted records into the web2 containers hosts file: root@00066ea46367:/# more /etc/hosts 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 172.17.0.2 web1 88f9c8629668 172.17.0.3 00066ea46367 root@00066ea46367:/# With this configuration, the web2 container can reach the web1 container either by the name we gave the container at runtime (web1) or the unique hostname Docker generated for the container (88f9c8629668). In addition to the hosts file being updated, web2 also generates some new environmental variables: root@00066ea46367:/# printenv WEB1_ENV_APACHE_LOG_DIR=/var/log/apache2 HOSTNAME=00066ea46367 APACHE_RUN_USER=www-data WEB1_PORT_80_TCP=tcp://172.17.0.2:80 WEB1_PORT_80_TCP_PORT=80 LS_COLORS= WEB1_PORT=tcp://172.17.0.2:80 WEB1_ENV_APACHE_RUN_GROUP=www-data APACHE_LOG_DIR=/var/log/apache2 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PWD=/ WEB1_PORT_80_TCP_PROTO=tcp APACHE_RUN_GROUP=www-data SHLVL=1 HOME=/root WEB1_PORT_80_TCP_ADDR=172.17.0.2 WEB1_ENV_APACHE_RUN_USER=www-data WEB1_NAME=/web2/web1 _=/usr/bin/printenv root@00066ea46367:/# You'll notice many new environmental variables. Docker will copy any environmental variables from the linked container that were defined as part of the container. This includes: Environmental variables described in the docker image. More specifically, any ENV variables from the images Dockerfile Environmental variables passed to the container at runtime through the --env or -e flag. In this case, these three variables were defined as ENV variables in the images Dockerfile: APACHE_RUN_USER=www-data APACHE_RUN_GROUP=www-data APACHE_LOG_DIR=/var/log/apache2 Since both container images have the same ENV variables defined we'll see the local variables as well as the same environmental variables from the container web1 prefixed with WEB1_ENV_: WEB1_ENV_APACHE_RUN_USER=www-data WEB1_ENV_APACHE_RUN_GROUP=www-data WEB1_ENV_APACHE_LOG_DIR=/var/log/apache2 In addition, Docker also created 6 other environmental variables that describe the web1 container as well as any of its exposed ports: WEB1_PORT=tcp://172.17.0.2:80 WEB1_PORT_80_TCP=tcp://172.17.0.2:80 WEB1_PORT_80_TCP_ADDR=172.17.0.2 WEB1_PORT_80_TCP_PORT=80 WEB1_PORT_80_TCP_PROTO=tcp WEB1_NAME=/web2/web1 Linking also allows you to specify aliases. For instance let's stop, remove, and respawn container web2 using a slightly different syntax for linking… user@docker1:~$ docker stop web2 web2 user@docker1:~$ docker rm web2 web2 user@docker1:~$ docker run -d -P --name=web2 --link=web1:webserver jonlangemak/web_server_2 e102fe52f8a08a02b01329605dcada3005208d9d63acea257b8d99b3ef78e71b user@docker1:~$ Notice that after the link definition we inserted a :webserver. The name after the colon represents the alias for the link. In this case, I've specified an alias for the container web1 as webserver. If we examine the web2 container, we'll see that the alias is now also listed in the hosts file: root@c258c7a0884d:/# more /etc/hosts …<Additional output removed for brevity>… 172.17.0.2 webserver 88f9c8629668 web1 172.17.0.3 c258c7a0884d root@c258c7a0884d:/# Aliases also impact the environmental variables created during linking. Rather than using the container name they'll instead use the alias: user@docker1:~$ docker exec web2 printenv …<Additional output removed for brevity>… WEBSERVER_PORT_80_TCP_ADDR=172.17.0.2 WEBSERVER_PORT_80_TCP_PORT=80 WEBSERVER_PORT_80_TCP_PROTO=tcp …<Additional output removed for brevity>… user@docker1:~$ At this point you might be wondering how dynamic this is. After all, Docker is providing this functionality by updating static files in each container. What happens if a container's IP address changes? For instance, let's stop the container web1 and start a new container called web3 using the same image: user@docker1:~$ docker stop web1 web1 user@docker1:~$ docker run -d -P --name=web3 jonlangemak/web_server_1 69fa80be8b113a079e19ca05c8be9e18eec97b7bbb871b700da4482770482715 user@docker1:~$ If you'll recall from earlier, the container web1 had an IP address of 172.17.0.2 allocated to it. Since I stopped the container, Docker will release that IP address reservation making it available to be reassigned to the next container we start. Let's check the IP address assigned to the container web3: user@docker1:~$ docker exec web3 ip addr show dev eth0 79: eth0@if80: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff inet 172.17.0.2/16 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::42:acff:fe11:2/64 scope link valid_lft forever preferred_lft forever user@docker1:~$ As expected, web3 took the now open IP address of 172.17.0.2 that previously belonged to the web1 container. We can also verify that the container web2 still believes that this IP address belongs to the web1 container: user@docker1:~$ docker exec –t web2 more /etc/hosts | grep 172.17.0.2 172.17.0.2 webserver 88f9c8629668 web1 user@docker1:~$ If we start the container web1 once again, we should see it will get a new IP address allocated to it: user@docker1:~$ docker start web1 web1 user@docker1:~$ docker exec web1 ip addr show dev eth0 81: eth0@if82: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 02:42:ac:11:00:04 brd ff:ff:ff:ff:ff:ff inet 172.17.0.4/16 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::42:acff:fe11:4/64 scope link valid_lft forever preferred_lft forever user@docker1:~$ If we check the container web2 again, we should see that Docker has updated it to reference web1's new IP address… user@docker1:~$ docker exec web2 more /etc/hosts | grep web1 172.17.0.4 webserver 88f9c8629668 web1 user@docker1:~$ However, while Docker takes care of updating the host file with the new IP address, it will not take care of updating any of the environmental variables to reflect the new IP address: user@docker1:~$ docker exec web2 printenv …<Additional output removed for brevity>… WEBSERVER_PORT=tcp://172.17.0.2:80 WEBSERVER_PORT_80_TCP=tcp://172.17.0.2:80 WEBSERVER_PORT_80_TCP_ADDR=172.17.0.2 …<Additional output removed for brevity>… user@docker1:~$ Additionally it should be pointed out that the link is only one way. That is, this link does not cause the container web1 to become aware of the web2 container. The container web1 will not receive the host records or the environmental variables referencing the web2: user@docker1:~$ docker exec -it web1 ping web2 ping: unknown host user@docker1:~$ Another reason to provision links is when you use Docker Inter Container Connectivity (ICC) mode set to false. As we've discussed previously, ICC prevents any containers on the same bridge from talking directly to each other. This forces them to talk to each other only though published ports. Linking provides a mechanism to override the default ICC rules. To demonstrate, let's stop and remove all the containers on our host docker1 and then add the following Docker option to the systemd drop in file: ExecStart=/usr/bin/dockerd --icc=false Now reload the systemd configuration, restart the service, and start the following containers: docker run -d -P --name=web1 jonlangemak/web_server_1 docker run -d -P --name=web2 jonlangemak/web_server_2 With ICC mode on you'll notice containers can't talk directly to each other: user@docker1:~$ docker exec web1 ip addr show dev eth0 87: eth0@if88: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff inet 172.17.0.2/16 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::42:acff:fe11:2/64 scope link valid_lft forever preferred_lft forever user@docker1:~$ docker exec -it web2 curl http://172.17.0.2 user@docker1:~$ In the preceding example, web2 is not able to access the web servers on web1. Now let's delete and recreate the web2 container this time linking it to web1: user@docker1:~$ docker stop web2 web2 user@docker1:~$ docker rm web2 web2 user@docker1:~$ docker run -d -P --name=web2 --link=web1 jonlangemak/web_server_2 4c77916bb08dfc586105cee7ae328c30828e25fcec1df55f8adba8545cbb2d30 user@docker1:~$ docker exec -it web2 curl http://172.17.0.2 <body> <html> <h1><span style="color:#FF0000;font-size:72px;">Web Server #1 - Running on port 80</span></h1> </body> </html> user@docker1:~$ We can see with the link in place the communication is allowed as expected. Once again, just like the link, this access is allowed in one direction. It should be noted that linking works differently when using user defined networks. In this recipe we covered what are now being called legacy links. Linking with user defined networks will be covered in the following recipes. Leveraging Docker DNS The introduction of user defined networks signaled a big change in Docker networking. While the ability to provision custom networks was the big news, there were also major enhancements in name resolution. User defined networks can benefit from what's being called embedded DNS. The Docker engine itself now has the ability to provide name resolution to all of the containers. This is a marked improvement from the legacy solution where the only means for name resolution was external DNS or linking which relied on the hosts file. In this recipe, we'll walk through how to use and configure embedded DNS. Getting Ready In this recipe we'll be demonstrating the configuration on a single Docker host. It is assumed that this host has Docker installed and that Docker is in its default configuration. We'll be altering name resolution settings on the host so you'll need root level access. How to do it… As mentioned, the embedded DNS system only works on user defined Docker networks. That being said, let's provision a user defined network and then start a simple container on it: user@docker1:~$ docker network create -d bridge mybridge1 0d75f46594eb2df57304cf3a2b55890fbf4b47058c8e43a0a99f64e4ede98f5f user@docker1:~$ docker run -d -P --name=web1 --net=mybridge1 jonlangemak/web_server_1 3a65d84a16331a5a84dbed4ec29d9b6042dde5649c37bc160bfe0b5662ad7d65 user@docker1:~$ As we saw in an earlier recipe, by default, Docker pulls the name resolution configuration from the Docker host and provides it to the container. This behavior can be changed by providing different DNS servers or search domains either at the service level or at container run time. In the case of containers connected to a user-defined network, the DNS settings provided to the container are slightly different. For instance, let's look at the resolv.conf file for the container we just connected to the user defined bridge mybridge1: user@docker1:~$ docker exec -t web1 more /etc/resolv.conf search lab.lab nameserver 127.0.0.11 options ndots:0 user@docker1:~$ Notice how the name server for this container is now 127.0.0.11. This IP address represents Docker's embedded DNS server and will be used for any container which is connected to a user-defined network. It is a requirement that any container connected to a user-defined use the embedded DNS server. Containers not initially started on a user defined network will get updated the moment they connect to a user defined network. For instance, let's start another container called web2 but have it use the default docker0 bridge: user@docker1:~$ docker run -dP --name=web2 jonlangemak/web_server_2 d0c414477881f03efac26392ffbdfb6f32914597a0a7ba578474606d5825df3f user@docker1:~$ docker exec -t web2 more /etc/resolv.conf :::::::::::::: /etc/resolv.conf :::::::::::::: # Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8) # DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN nameserver 10.20.30.13 search lab.lab user@docker1:~$ If we now connect the web2 container to our user-defined network, Docker will update the name server to reflect the embedded DNS server: user@docker1:~$ docker network connect mybridge1 web2 user@docker1:~$ docker exec -t web2 more /etc/resolv.conf search lab.lab nameserver 127.0.0.11 options ndots:0 user@docker1:~$ Since both our containers are now connected to the same user-defined network they can now reach each other by name: user@docker1:~$ docker exec -t web1 ping web2 -c 2 PING web2 (172.18.0.3): 48 data bytes 56 bytes from 172.18.0.3: icmp_seq=0 ttl=64 time=0.107 ms 56 bytes from 172.18.0.3: icmp_seq=1 ttl=64 time=0.087 ms --- web2 ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.087/0.097/0.107/0.000 ms user@docker1:~$ docker exec -t web2 ping web1 -c 2 PING web1 (172.18.0.2): 48 data bytes 56 bytes from 172.18.0.2: icmp_seq=0 ttl=64 time=0.060 ms 56 bytes from 172.18.0.2: icmp_seq=1 ttl=64 time=0.119 ms --- web1 ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.060/0.089/0.119/0.030 ms user@docker1:~$ You'll note that the name resolution is bidirectional and works inherently without the use of any links. That being said, with user defined networks, we can still define links for the purpose of creating local aliases. For instance, let's stop and remove both containers web1 and web2 and reprovision them as follows: user@docker1:~$ docker run -d -P --name=web1 --net=mybridge1 --link=web2:thesecondserver jonlangemak/web_server_1 fd21c53def0c2255fc20991fef25766db9e072c2bd503c7adf21a1bd9e0c8a0a user@docker1:~$ docker run -d -P --name=web2 --net=mybridge1 --link=web1:thefirstserver jonlangemak/web_server_2 6e8f6ab4dec7110774029abbd69df40c84f67bcb6a38a633e0a9faffb5bf625e user@docker1:~$ The first interesting item to point out is that Docker let us link to a container that did not yet exist. When we ran the container web1 we asked Docker to link it to the container web2. At that point, web2 didn't exist. This is a notable difference in how links work with the embedded DNS server. In legacy linking Docker needed to know the target containers information prior to making the link. This was because it had to manually update the source containers host file and environmental variables. The second interesting item is that aliases are no longer listed in the containers hosts file. If we look at the hosts file on each container we'll see that the linking no longer generates entries: user@docker1:~$ docker exec -t web1 more /etc/resolv.conf search lab.lab nameserver 127.0.0.11 options ndots:0 user@docker1:~$ docker exec -t web2 more /etc/resolv.conf search lab.lab nameserver 127.0.0.11 options ndots:0 user@docker1:~$ All of the resolution is now occurring in the embedded DNS server. This includes keeping track of defined aliases and their scope. So even without host records, each container is able to resolve the other containers alias through the embedded DNS server: user@docker1:~$ docker exec -t web1 ping thesecondserver -c2 PING thesecondserver (172.18.0.3): 48 data bytes 56 bytes from 172.18.0.3: icmp_seq=0 ttl=64 time=0.067 ms 56 bytes from 172.18.0.3: icmp_seq=1 ttl=64 time=0.067 ms --- thesecondserver ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.067/0.067/0.067/0.000 ms user@docker1:~$ docker exec -t web2 ping thefirstserver -c 2 PING thefirstserver (172.18.0.2): 48 data bytes 56 bytes from 172.18.0.2: icmp_seq=0 ttl=64 time=0.062 ms 56 bytes from 172.18.0.2: icmp_seq=1 ttl=64 time=0.042 ms --- thefirstserver ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.042/0.052/0.062/0.000 ms user@docker1:~$ The aliases created have a scope that is local to the container itself. For instance, a third container on the same user defined network is not able to resolve the aliases created as part of the links: user@docker1:~$ docker run -d -P --name=web3 --net=mybridge1 jonlangemak/web_server_1 d039722a155b5d0a702818ce4292270f30061b928e05740d80bb0c9cb50dd64f user@docker1:~$ docker exec -it web3 ping thefirstserver -c 2 ping: unknown host user@docker1:~$ docker exec -it web3 ping thesecondserver -c 2 ping: unknown host user@docker1:~$ You'll recall that legacy linking also automatically created a set of environmental variables on the source container. These environmental variables referenced the target container and any ports it might be exposing. Linking in user defined networks does not create these environmental variables: user@docker1:~$ docker exec web1 printenv PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=4eba77b66d60 APACHE_RUN_USER=www-data APACHE_RUN_GROUP=www-data APACHE_LOG_DIR=/var/log/apache2 HOME=/root user@docker1:~$ As we saw in the previous recipe, keeping these variables up to date wasn't achievable even with legacy links. That being said, it's not a total surprise the functionality doesn't exist when dealing with user defined networks. In addition to providing local container resolution, the embedded DNS server also handles any external requests. As we saw in the preceding example, the search domain from the Docker host (lab.lab in my case) was still being passed down to the containers and configured in their resolv.conf file. The name server learned from the host becomes a forwarder for the embedded DNS server. This allows the embedded DNS server to process any container name resolution requests and hand off external requests to the name server used by the Docker host. This behavior can be overridden either at the service level or by passing the --dns or --dns-search flag to a container at run time. For instance, we can start two more instances of the web1 container and specify a specific DNS server in either case: user@docker1:~$ docker run -dP --net=mybridge1 --name=web4 --dns=10.20.30.13 jonlangemak/web_server_1 19e157b46373d24ca5bbd3684107a41f22dea53c91e91e2b0d8404e4f2ccfd68 user@docker1:~$ docker run -dP --net=mybridge1 --name=web5 --dns=8.8.8.8 jonlangemak/web_server_1 700f8ac4e7a20204100c8f0f48710e0aab8ac0f05b86f057b04b1bbfe8141c26 user@docker1:~$ Note that web4 would receive 10.20.30.13 as a DNS forwarder even if we didn't specify it explicitly. This is because that's also the DNS server used by the Docker host and when not specified the container inherits from the host. It is specified here for the sake of the example. Now if we try to resolve a local DNS record on either container we can see that in the case of web1 it works since it has the local DNS server defined whereas the lookup on web2 fails because 8.8.8.8 doesn't know about the lab.lab domain: user@docker1:~$ docker exec -it web4 ping docker1.lab.lab -c 2 PING docker1.lab.lab (10.10.10.101): 48 data bytes 56 bytes from 10.10.10.101: icmp_seq=0 ttl=64 time=0.080 ms 56 bytes from 10.10.10.101: icmp_seq=1 ttl=64 time=0.078 ms --- docker1.lab.lab ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max/stddev = 0.078/0.079/0.080/0.000 ms user@docker1:~$ docker exec -it web5 ping docker1.lab.lab -c 2 ping: unknown host user@docker1:~$ Summary In this article we discussed the available options for container name resolution. This includes both the default name resolution behavior as well as the new embedded DNS server functionality that exists with user defined networks. You will get hold on the process used to determine name server assignment under each of these scenarios. Resources for Article: Further resources on this subject: Managing Application Configuration [article] Virtualizing Hosts and Applications [article] Deploying a Play application on CoreOS and Docker [article]
Read more
  • 0
  • 0
  • 23531

article-image-hosting-google-app-engine
Packt
21 Oct 2016
22 min read
Save for later

Hosting on Google App Engine

Packt
21 Oct 2016
22 min read
In this article by Mat Ryer, the author of the book Go Programming Blueprints Second Edition, we will see how to create a successful Google Application and deploy it in Google App Engine along with Googles Cloud data storage facility for App Engine Developers. (For more resources related to this topic, see here.) Google App Engine gives developers a NoOps (short for No Operations, indicating that developers and engineers have no work to do in order to have their code running and available) way of deploying their applications, and Go has been officially supported as a language option for some years now. Google's architecture runs some of the biggest applications in the world, such as Google Search, Google Maps, Gmail, among others, so is a pretty safe bet when it comes to deploying our own code. Google App Engine allows you to write a Go application, add a few special configuration files, and deploy it to Google's servers, where it will be hosted and made available in a highly available, scalable, and elastic environment. Instances will automatically spin up to meet demand and tear down gracefully when they are no longer needed with a healthy free quota and preapproved budgets. Along with running application instances, Google App Engine makes available a myriad of useful services, such as fast and high-scale data stores, search, memcache, and task queues. Transparent load balancing means you don't need to build and maintain additional software or hardware to ensure servers don't get overloaded and that requests are fulfilled quickly. In this article, we will build the API backend for a question and answer service similar to Stack Overflow or Quora and deploy it to Google App Engine. In the process, we'll explore techniques, patterns, and practices that can be applied to all such applications as well as dive deep into some of the more useful services available to our application. Specifically, in this article, you will learn: How to use the Google App Engine SDK for Go to build and test applications locally before deploying to the cloud How to use app.yaml to configure your application How Modules in Google App Engine let you independently manage the different components that make up your application How the Google Cloud Datastore lets you persist and query data at scale A sensible pattern for the modeling of data and working with keys in Google Cloud Datastore How to use the Google App Engine Users API to authenticate people with Google accounts A pattern to embed denormalized data into entities The Google App Engine SDK for Go In order to run and deploy Google App Engine applications, we must download and configure the Go SDK. Head over to https://cloud.google.com/appengine/downloads and download the latest Google App Engine SDK for Go for your computer. The ZIP file contains a folder called go_appengine, which you should place in an appropriate folder outside of your GOPATH, for example, in /Users/yourname/work/go_appengine. It is possible that the names of these SDKs will change in the future—if that happens, ensure that you consult the project home page for notes pointing you in the right direction at https://github.com/matryer/goblueprints. Next, you will need to add the go_appengine folder to your $PATH environment variable, much like what you did with the go folder when you first configured Go. To test your installation, open a terminal and type this: goapp version You should see something like the following: go version go1.6.1 (appengine-1.9.37) darwin/amd64 The actual version of Go is likely to differ and is often a few months behind actual Go releases. This is because the Cloud Platform team at Google needs to do work on its end to support new releases of Go. The goapp command is a drop-in replacement for the go command with a few additional subcommands; so you can do things like goapp test and goapp vet, for example. Creating your application In order to deploy an application to Google's servers, we must use the Google Cloud Platform Console to set it up. In a browser, go to https://console.cloud.google.com and sign in with your Google account. Look for the Create Project menu item, which often gets moved around as the console changes from time to time. If you already have some projects, click on a project name to open a submenu, and you'll find it in there. If you can't find what you're looking for, just search Creating App Engine project and you'll find it. When the New Project dialog box opens, you will be asked for a name for your application. You are free to call it whatever you like (for example, Answers), but note the Project ID that is generated for you; you will need to refer to this when you configure your app later. You can also click on Edit and specify your own ID, but know that the value must be globally unique, so you'll have to get creative when thinking one up. Here we will use answersapp as the application ID, but you won't be able to use that one since it has already been taken. You may need to wait a minute or two for your project to get created; there's no need to watch the page—you can continue and check back later. App Engine applications are Go packages Now that the Google App Engine SDK for Go is configured and our application has been created, we can start building it. In Google App Engine, an application is just a normal Go package with an init function that registers handlers via the http.Handle or http.HandleFunc functions. It does not need to be the main package like normal tools. Create a new folder (somewhere inside your GOPATH folder) called answersapp/api and add the following main.go file: package api import ( "io" "net/http" ) func init() { http.HandleFunc("/", handleHello) } func handleHello(w http.ResponseWriter, r *http.Request) { io.WriteString(w, "Hello from App Engine") } You will be familiar with most of this by now, but note that there is no ListenAndServe call, and the handlers are set inside the init function rather than main. We are going to handle every request with our simple handleHello function, which will just write a welcoming string. The app.yaml file In order to turn our simple Go package into a Google App Engine application, we must add a special configuration file called app.yaml. The file will go at the root of the application or module, so create it inside the answersapp/api folder with the following contents: application: YOUR_APPLICATION_ID_HERE version: 1 runtime: go api_version: go1 handlers: - url: /.* script: _go_app The file is a simple human–(and machine) readable configuration file in YAML (Yet Another Markup Language format—refer to yaml.org for more details). The following table describes each property: Property Description application The application ID (copied and pasted from when you created your project). version Your application version number—you can deploy multiple versions and even split traffic between them to test new features, among other things. We'll just stick with version 1 for now. runtime The name of the runtime that will execute your application. Since we're building a Go application, we'll use go. api_version The go1 api version is the runtime version supported by Google; you can imagine that this could be go2 in the future. handlers A selection of configured URL mappings. In our case, everything will be mapped to the special _go_app script, but you can also specify static files and folders here. Running simple applications locally Before we deploy our application, it makes sense to test it locally. We can do this using the App Engine SDK we downloaded earlier. Navigate to your answersapp/api folder and run the following command in a terminal: goapp serve You should see the following output: This indicates that an API server is running locally on port :56443, an admin server is running on :8000, and our application (the module default) is now serving at localhost:8080, so let's hit that one in a browser. As you can see by the Hello from App Engine response, our application is running locally. Navigate to the admin server by changing the port from :8080 to :8000. The preceding screenshot shows the web portal that we can use to interrogate the internals of our application, including viewing running instances, inspecting the data store, managing task queues, and more. Deploying simple applications to Google App Engine To truly understand the power of Google App Engine's NoOps promise, we are going to deploy this simple application to the cloud. Back in the terminal, stop the server by hitting Ctrl+C and run the following command: goapp deploy Your application will be packaged and uploaded to Google's servers. Once it's finished, you should see something like the following: Completed update of app: theanswersapp, version: 1 It really is as simple as that. You can prove this by navigating to the endpoint you get for free with every Google App Engine application, remembering to replace the application ID with your own: https://YOUR_APPLICATION_ID_HERE.appspot.com/. You will see the same output as earlier (the font may render differently since Google's servers will make assumptions about the content type that the local dev server doesn't). The application is being served over HTTP/2 and is already capable of pretty massive scale, and all we did was write a config file and a few lines of code. Modules in Google App Engine A module is a Go package that can be versioned, updated, and managed independently. An app might have a single module, or it can be made up of many modules: each distinct but part of the same application with access to the same data and services. An application must have a default module—even if it doesn't do much. Our application will be made up of the following modules: Description The module name The obligatory default module default An API package delivering RESTful JSON api A static website serving HTML, CSS, and JavaScript that makes AJAX calls to the API module web Each module will be a Go package and will, therefore, live inside its own folder. Let's reorganize our project into modules by creating a new folder alongside the api folder called default. We are not going to make our default module do anything other than use it for configuration, as we want our other modules to do all the meaningful work. But if we leave this folder empty, the Google App Engine SDK will complain that it has nothing to build. Inside the default folder, add the following placeholder main.go file: package defaultmodule func init() {} This file does nothing except allowing our default module to exist. It would have been nice for our package names to match the folders, but default is a reserved keyword in Go, so we have a good reason to break that rule. The other module in our application will be called web, so create another folder alongside the api and default folders called web. Here we are only going to build the API for our application and cheat by downloading the web module. Head over to the project home page at https://github.com/matryer/goblueprints, access the content for Second Edition, and look for the download link for the web components for this article in the Downloads section of the README file. The ZIP file contains the source files for the web component, which should be unzipped and placed inside the web folder. Now, our application structure should look like this: /answersapp/api /answersapp/default /answersapp/web Specifying modules To specify which module our api package will become, we must add a property to the app.yaml inside our api folder. Update it to include the module property: application: YOUR_APPLICATION_ID_HERE version: 1 runtime: go module: api api_version: go1 handlers: - url: /.* script: _go_app Since our default module will need to be deployed as well, we also need to add an app.yaml configuration file to it. Duplicate the api/app.yaml file inside default/app.yaml, changing the module to default: application: YOUR_APPLICATION_ID_HERE version: 1 runtime: go module: default api_version: go1 handlers: - url: /.* script: _go_app Routing to modules with dispatch.yaml In order to route traffic appropriately to our modules, we will create another configuration file called dispatch.yaml, which will let us map URL patterns to the modules. We want all traffic beginning with the /api/ path to be routed to the api module and everything else to the web module. As mentioned earlier, we won't expect our default module to handle any traffic, but it will have more utility later. In the answersapp folder (alongside our module folders—not inside any of the module folders), create a new file called dispatch.yaml with the following contents: application: YOUR_APPLICATION_ID_HERE dispatch: - url: "*/api/*" module: api - url: "*/*" module: web The same application property tells the Google App Engine SDK for Go which application we are referring to, and the dispatch section routes URLs to modules. Google Cloud Datastore One of the services available to App Engine developers is Google Cloud Datastore, a NoSQL document database built for automatic scaling and high performance. Its limited feature-set guarantees very high scale, but understanding the caveats and best practices is vital to a successful project. Denormalizing data Developers with experience of relational databases (RDBMS) will often aim to reduce data redundancy (trying to have each piece of data appear only once in their database) by normalizing data, spreading it across many tables, and adding references (foreign keys) before joining it back via a query to build a complete picture. In schemaless and NoSQL databases, we tend to do the opposite. We denormalize data so that each document contains the complete picture it needs, making read times extremely fast—since it only needs to go and get a single thing. For example, consider how we might model tweets in a relational database such as MySQL or Postgres: A tweet itself contains only its unique ID, a foreign key reference to the Users table representing the author of the tweet, and perhaps many URLs that were mentioned in TweetBody. One nice feature of this design is that a user can change their Name or AvatarURL and it will be reflected in all of their tweets, past and future: something you wouldn't get for free in a denormalized world. However, in order to present a tweet to the user, we must load the tweet itself, look up (via a join) the user to get their name and avatar URL, and then load the associated data from the URLs table in order to show a preview of any links. At scale, this becomes difficult because all three tables of data might well be physically separated from each other, which means lots of things need to happen in order to build up this complete picture. Consider what a denormalized design would look like instead: We still have the same three buckets of data, except that now our tweet contains everything it needs in order to render to the user without having to look up data from anywhere else. The hardcore relational database designers out there are realizing what this means by now, and it is no doubt making them feel uneasy. Following this approach means that: Data is repeated—AvatarURL in User is repeated as UserAvatarURL in the tweet (waste of space, right?) If the user changes their AvatarURL, UserAvatarURL in the tweet will be out of date Database design, at the end of the day, comes down to physics. We are deciding that our tweet is going to be read far more times than it is going to be written, so we'd rather take the pain up-front and take a hit in storage. There's nothing wrong with repeated data as long as there is an understanding about which set is the master set and which is duplicated for speed. Changing data is an interesting topic in itself, but let's think about a few reasons why we might be OK with the trade-offs. Firstly, the speed benefit to reading tweets is probably worth the unexpected behavior of changes to master data not being reflected in historical documents; it would be perfectly acceptable to decide to live with this emerged functionality for that reason. Secondly, we might decide that it makes sense to keep a snapshot of data at a specific moment in time. For example, imagine if someone tweets asking whether people like their profile picture. If the picture changed, the tweet context would be lost. For a more serious example, consider what might happen if you were pointing to a row in an Addresses table for an order delivery and the address later changed. Suddenly, the order might look like it was shipped to a different place. Finally, storage is becoming increasingly cheaper, so the need for normalizing data to save space is lessened. Twitter even goes as far as copying the entire tweet document for each of your followers. 100 followers on Twitter means that your tweet will be copied at least 100 times, maybe more for redundancy. This sounds like madness to relational database enthusiasts, but Twitter is making smart trade-offs based on its user experience; they'll happily spend a lot of time writing a tweet and storing it many times to ensure that when you refresh your feed, you don't have to wait very long to get updates. If you want to get a sense of the scale of this, check out the Twitter API and look at what a tweet document consists of. It's a lot of data. Then, go and look at how many followers Lady Gaga has. This has become known in some circles as "the Lady Gaga problem" and is addressed by a variety of different technologies and techniques that are out of the scope of this article. Now that we have an understanding of good NoSQL design practices, let's implement the types, functions, and methods required to drive the data part of our API. Entities and data access To persist data in Google Cloud Datastore, we need a struct to represent each entity. These entity structures will be serialized and deserialized when we save and load data through the datastore API. We can add helper methods to perform the interactions with the data store, which is a nice way to keep such functionality physically close to the entities themselves. For example, we will model an answer with a struct called Answer and add a Create method that in turn calls the appropriate function from the datastore package. This prevents us from bloating our HTTP handlers with lots of data access code and allows us to keep them clean and simple instead. One of the foundation blocks of our application is the concept of a question. A question can be asked by a user and answered by many. It will have a unique ID so that it is addressable (referable in a URL), and we'll store a timestamp of when it was created. type Question struct { Key *datastore.Key `json:"id" datastore:"-"` CTime time.Time `json:"created"` Question string `json:"question"` User UserCard `json:"user"` AnswersCount int `json:"answers_count"` } The UserCard struct represents a denormalized User entity, both of which we'll add later. You can import the datastore package in your Go project using this: import "google.golang.org/appengine/datastore" It's worth spending a little time understanding the datastore.Key type. Keys in Google Cloud Datastore Every entity in Datastore has a key, which uniquely identifies it. They can be made up of either a string or an integer depending on what makes sense for your case. You are free to decide the keys for yourself or let Datastore automatically assign them for you; again, your use case will usually decide which is the best approach to take Keys are created using datastore.NewKey and datastore.NewIncompleteKey functions and are used to put and get data into and out of Datastore via the datastore.Get and datastore.Put functions. In Datastore, keys and entity bodies are distinct, unlike in MongoDB or SQL technologies, where it is just another field in the document or record. This is why we are excluding Key from our Question struct with the datastore:"-" field tag. Like the json tags, this indicates that we want Datastore to ignore the Key field altogether when it is getting and putting data. Keys may optionally have parents, which is a nice way of grouping associated data together and Datastore makes certain assurances about such groups of entities, which you can read more about in the Google Cloud Datastore documentation online. Putting data into Google Cloud Datastore Before we save data into Datastore, we want to ensure that our question is valid. Add the following method underneath the Question struct definition: func (q Question) OK() error { if len(q.Question) < 10 { return errors.New("question is too short") } return nil } The OK function will return an error if something is wrong with the question, or else it will return nil. In this case, we just check to make sure the question has at least 10 characters. To persist this data in the data store, we are going to add a method to the Question struct itself. At the bottom of questions.go, add the following code: func (q *Question) Create(ctx context.Context) error { log.Debugf(ctx, "Saving question: %s", q.Question) if q.Key == nil { q.Key = datastore.NewIncompleteKey(ctx, "Question", nil) } user, err := UserFromAEUser(ctx) if err != nil { return err } q.User = user.Card() q.CTime = time.Now() q.Key, err = datastore.Put(ctx, q.Key, q) if err != nil { return err } return nil } The Create method takes a pointer to Question as the receiver, which is important because we want to make changes to the fields. If the receiver was (q Question)—without *, we would get a copy of the question rather than a pointer to it, and any changes we made to it would only affect our local copy and not the original Question struct itself. The first thing we do is use log (from the google.golang.org/appengine/log package) to write a debug statement saying we are saving the question. When you run your code in a development environment, you will see this appear in the terminal; in production, it goes into a dedicated logging service provided by Google Cloud Platform. If the key is nil (that means this is a new question), we assign an incomplete key to the field, which informs Datastore that we want it to generate a key for us. The three arguments we pass are context.Context (which we must pass to all datastore functions and methods), a string describing the kind of entity, and the parent key; in our case, this is nil. Once we know there is a key in place, we call a method (which we will add later) to get or create User from an App Engine user and set it to the question and then set the CTime field (created time) to time.Now—timestamping the point at which the question was asked. One we have our Question function in good shape, we call datastore.Put to actually place it inside the data store. As usual, the first argument is context.Context, followed by the question key and the question entity itself. Since Google Cloud Datastore treats keys as separate and distinct from entities, we have to do a little extra work if we want to keep them together in our own code. The datastore.Put method returns two arguments: the complete key and error. The key argument is actually useful because we're sending in an incomplete key and asking the data store to create one for us, which it does during the put operation. If successful, it returns a new datastore.Key object to us, representing the completed key, which we then store in our Key field in the Question object. If all is well, we return nil. Add another helper to update an existing question: func (q *Question) Update(ctx context.Context) error { if q.Key == nil { q.Key = datastore.NewIncompleteKey(ctx, "Question", nil) } var err error q.Key, err = datastore.Put(ctx, q.Key, q) if err != nil { return err } return nil } This method is very similar except that it doesn't set the CTime or User fields, as they will already have been set. Reading data from Google Cloud Datastore Reading data is as simple as putting it with the datastore.Get method, but since we want to maintain keys in our entities (and datastore methods don't work like that), it's common to add a helper function like the one we are going to add to questions.go: func GetQuestion(ctx context.Context, key *datastore.Key) (*Question, error) { var q Question err := datastore.Get(ctx, key, &q) if err != nil { return nil, err } q.Key = key return &q, nil } The GetQuestion function takes context.Context and the datastore.Key method of the question to get. It then does the simple task of calling datastore.Get and assigning the key to the entity before returning it. Of course, errors are handled in the usual way. This is a nice pattern to follow so that users of your code know that they never have to interact with datastore.Get and datastore.Put directly but rather use the helpers that can ensure the entities are properly populated with the keys (along with any other tweaks that they might want to do before saving or after loading). Summary This article thus gives us an idea about the Go App functionality, how to create a simple application and upload on Google App Engine thus giving a clear understanding of configurations and its working Further we also get some ideas about modules in Google App Engine and also Googles cloud data storage facility for App Engine Developers Resources for Article: Further resources on this subject: Google Forms for Multiple Choice and Fill-in-the-blank Assignments [article] Publication of Apps [article] Prerequisites for a Map Application [article]
Read more
  • 0
  • 0
  • 20960

article-image-data-science-venn-diagram
Packt
21 Oct 2016
15 min read
Save for later

The Data Science Venn Diagram

Packt
21 Oct 2016
15 min read
It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. In this article by Sinan Ozdemir, author of the book Principles of Data Science, we will discuss how data science begins with three basic areas: Math/statistics: This is the use of equations and formulas to perform analysis Computer programming: This is the ability to use code to create outcomes on the computer Domain knowledge: This refers to understanding the problem domain (medicine, finance, social science, and so on) (For more resources related to this topic, see here.) The following Venn diagram provides a visual representation of how the three areas of data science intersect: The Venn diagram of data science Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics knowledge base allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive (domain) expertise allows you to apply concepts and results in a meaningful and effective way. While having only two of these three qualities can make you intelligent, it will also leave a gap. Consider that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place but lack the math skills to evaluate your algorithms and, therefore, end up losing money in the long run. It is only when you can boast skills in coding, math, and domain knowledge, can you truly perform data science. The one that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers. Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and above all, understand our analyses' place in the domain we are in. This includes presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist. Also, note that the intersection of math and coding is machine learning, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just as algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data but if you don't understand how to apply this model in a practical sense such that doctors and nurses can easily use it, your model might be useless. Domain knowledge comes with both practice of data science and reading examples of other people's analyses. The math Most people stop listening once someone says the word "math". They'll nod along in an attempt to hide their utter disdain for the topic. We will use these subdomains of mathematics to create what are called models. A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon. Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding theory allows us to apply a model that we built for the fashion industry to a financial model. Every mathematical concept I introduce, I do so with care, examples, and purpose. The math in this article is essential for data scientists. Example – Spawner-Recruit Models In biology, we use, among many others, a model known as the Spawner-Recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the following graph was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that group would obtain and vice versa? Essentially, models allow us to plug in one variable to get the other. Consider the following example: In this example, let's say we knew that a group of salmons had 1.15 (in thousands) of spawners. Then, we would have the following: This result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change. There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the "best" model possible. We no longer rely on human instincts, rather, we rely on data. Spawner-Recruit model visualized The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible. Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere. Computer programming Let's be honest. You probably think computer science is way cooler than math. That's ok, I don't blame you. The news isn't filled with math news like it is with news on the technological front. You don't turn on the TV to see a new theory on primes, rather you will see investigative reports on how the latest smartphone can take photos of cats better or something. Computer languages are how we communicate with the machine and tell it to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages available to us. This article will focus exclusively on using Python. Why Python? We will use Python for a variety of reasons: Python is an extremely simple language to read and write even if you've coded before, which will make future examples easy to ingest and read later. It is one of the most common languages in production and in the academic setting (one of the fastest growing as a matter of fact). The online community of the language is vast and friendly. This means that a quick Google search should yield multiple results of people who have faced and solved similar (if not exact) situations. Python has prebuilt data science modules that both the novice and the veteran data scientist can utilize. The last is probably the biggest reason we will focus on Python. These prebuilt modules are not only powerful but also easy to pick up. Some of these modules are as follows: pandas sci-kit learn seaborn numpy/scipy requests (to mine data from the web) BeautifulSoup (for Web HTML parsing) Python practices Before we move on, it is important to formalize many of the requisite coding skills in Python. In Python, we have variables thatare placeholders for objects. We will focus on only a few types of basic objects at first: int (an integer) Examples: 3, 6, 99, -34, 34, 11111111 float (a decimal): Examples: 3.14159, 2.71, -0.34567 boolean (either true or false) The statement, Sunday is a weekend, is true The statement, Friday is a weekend, is false The statement, pi is exactly the ratio of a circle's circumference to its diameter, is true (crazy, right?) string (text or words made up of characters) I love hamburgers (by the way who doesn't?) Matt is awesome A Tweet is a string a list (a collection of objects) Example: 1, 5.4, True, "apple" We will also have to understand some basic logistical operators. For these operators, keep the boolean type in mind. Every operator will evaluate to either true or false. == evaluates to true if both sides are equal, otherwise it evaluates to false 3 + 4 == 7     (will evaluate to true) 3 – 2 == 7     (will evaluate to false) <  (less than) 3  < 5             (true) 5  < 3             (false) <= (less than or equal to) 3  <= 3             (true) 5  <= 3             (false) > (greater than) 3  > 5             (false) 5  > 3             (true) >= (greater than or equal to) 3  >= 3             (true) 5  >= 3             (false) When coding in Python, I will use a pound sign (#) to create a comment, which will not be processed as code but is merely there to communicate with the reader. Anything to the right of a # is a comment on the code being executed. Example of basic Python In Python, we use spaces/tabs to denote operations that belong to other lines of code. Note the use of the if statement. It means exactly what you think it means. When the statement after the if statement is true, then the tabbed part under it will be executed, as shown in the following code: X = 5.8 Y = 9.5 X + Y == 15.3 # This is True! X - Y == 15.3 # This is False! if x + y == 15.3: # If the statement is true: print "True!" # print something! The print "True!" belongs to the if x + y == 15.3: line preceding it because it is tabbed right under it. This means that the print statement will be executed if and only if x + y equals 15.3. Note that the following list variable, my_list, can hold multiple types of objects. This one has an int, a float, boolean, and string (in that order): my_list = [1, 5.7, True, "apples"] len(my_list) == 4 # 4 objects in the list my_list[0] == 1 # the first object my_list[1] == 5.7 # the second object In the preceding code: I used the len command to get the length of the list (which was four). Note the zero-indexing of Python. Most computer languages start counting at zero instead of one. So if I want the first element, I call the index zero and if I want the 95th element, I call the index 94. Example – parsing a single Tweet Here is some more Python code. In this example, I will be parsing some tweets about stock prices: tweet = "RT @j_o_n_dnger: $TWTR now top holding for Andor, unseating $AAPL" words_in_tweet = first_tweet.split(' ') # list of words in tweet for word in words_in_tweet: # for each word in list if "$" in word: # if word has a "cashtag" print "THIS TWEET IS ABOUT", word # alert the user I will point out a few things about this code snippet, line by line, as follows: We set a variable to hold some text (known as a string in Python). In this example, the tweet in question is "RT @robdv: $TWTR now top holding for Andor, unseating $AAPL" The words_in_tweet variable "tokenizes" the tweet (separates it by word). If you were to print this variable, you would see the following: "['RT', '@robdv:', '$TWTR', 'now', 'top', 'holding', 'for', 'Andor,', 'unseating', '$AAPL'] We iterate through this list of words. This is called a for loop. It just means that we go through a list one by one. Here, we have another if statement. For each word in this tweet, if the word contains the $ character (this is how people reference stock tickers on twitter). If the preceding if statement is true (that is, if the tweet contains a cashtag), print it and show it to the user. The output of this code will be as follows: We get this output as these are the only words in the tweet that use the cashtag. Whenever I use Python in this article, I will ensure that I am as explicit as possible about what I am doing in each line of code. Domain knowledge As I mentioned earlier, this category focuses mainly on having knowledge about the particular topic you are working on. For example, if you are a financial analyst working on stock market data, you have a lot of domain knowledge. If you are a journalist looking at worldwide adoption rates, you might benefit from consulting an expert in the field. Does that mean that if you're not a doctor, you can't work with medical data? Of course not! Great data scientists can apply their skills to any area, even if they aren't fluent in it. Data scientists can adapt to the field and contribute meaningfully when their analysis is complete. A big part of domain knowledge is presentation. Depending on your audience, it can greatly matter how you present your findings. Your results are only as good as your vehicle of communication. You can predict the movement of the market with 99.99% accuracy, but if your program is impossible to execute, your results will go unused. Likewise, if your vehicle is inappropriate for the field, your results will go equally unused. Some more terminology This is a good time to define some more vocabulary. By this point, you're probably excitedly looking up a lot of data science material and seeing words and phrases I haven't used yet. Here are some common terminologies you are likely to come across: Machine learning: This refers to giving computers the ability to learn from data without explicit "rules" being given by a programmer. Machine learning combines the power of computers with intelligent learning algorithms in order to automate the discovery of relationships in data and creation of powerful data models. Speaking of data models, we will concern ourselves with the following two basic types of data models: Probabilistic model: This refers to using probability to find a relationship between elements that includes a degree of randomness Statistical model: This refers to taking advantage of statistical theorems to formalize relationships between data elements in a (usually) simple mathematical formula While both the statistical and probabilistic models can be run on computers and might be considered machine learning in that regard, we will keep these definitions separate as machine learning algorithms generally attempt to learn relationships in different ways. Exploratory data analysis – This refers to preparing data in order to standardize results and gain quick insights Exploratory data analysis (EDA) is concerned with data visualization and preparation. This is where we turn unorganized data into organized data and also clean up missing/incorrect data points. During EDA, we will create many types of plots and use these plots in order to identify key features and relationships to exploit in our data models. Data mining – This is the process of finding relationships between elements of data. Data mining is the part of Data science where we try to find relationships between variables (think spawn-recruit model). I tried pretty hard not to use the term big data up until now. It's because I think this term is misused, a lot. While the definition of this word varies from person to person. Big datais data that is too large to be processed by a single machine (if your laptop crashed, it might be suffering from a case of big data). The state of data science so far (this diagram is incomplete and is meant for visualization purposes only). Summary More and more people are jumping headfirst into the field of data science, most with no prior experience in math or CS, which on the surface is great. Average data scientists have access to millions of dating profiles' data, tweets, online reviews, and much more in order to jumpstart their education. However, if you jump into data science without the proper exposure to theory or coding practices and without respect of the domain you are working in, you face the risk of oversimplifying the very phenomenon you are trying to model. Resources for Article: Further resources on this subject: Reconstructing 3D Scenes [article] Basics of Classes and Objects [article] Saying Hello! [article]
Read more
  • 0
  • 0
  • 12459

article-image-jupyter-and-python-scripting
Packt
21 Oct 2016
9 min read
Save for later

Jupyter and Python Scripting

Packt
21 Oct 2016
9 min read
In this article by Dan Toomey, author of the book Learning Jupyter, we will see data access in Jupyter with Python and the effect of pandas on Jupyter. We will also see Python graphics and lastly Python random numbers. (For more resources related to this topic, see here.) Python data access in Jupyter I started a view for pandas using Python Data Access as the name. We will read in a large dataset and compute some standard statistics on the data. We are interested in seeing how we use pandas in Jupyter, how well the script performs, and what information is stored in the metadata (especially if it is a larger dataset). Our script accesses the iris dataset built into one of the Python packages. All we are looking to do is read in a slightly large number of items and calculate some basic operations on the dataset. We are really interested in seeing how much of the data is cached in the PYNB file. The Python code is: # import the datasets package from sklearn import datasets # pull in the iris data iris_dataset = datasets.load_iris() # grab the first two columns of data X = iris_dataset.data[:, :2] # calculate some basic statistics x_count = len(X.flat) x_min = X[:, 0].min() - .5 x_max = X[:, 0].max() + .5 x_mean = X[:, 0].mean() # display our results x_count, x_min, x_max, x_mean I broke these steps into a couple of cells in Jupyter, as shown in the following screenshot: Now, run the cells (using Cell | Run All) and you get this display below. The only difference is the last Out line where our values are displayed. It seemed to take longer to load the library (the first time I ran the script) than to read the data and calculate the statistics. If we look in the PYNB file for this notebook, we see that none of the data is cached in the PYNB file. We simply have code references to the library, our code, and the output from when we last calculated the script: { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(300, 3.7999999999999998, 8.4000000000000004, 5.8433333333333337)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate some basic statisticsn", "x_count = len(X.flat)n", "x_min = X[:, 0].min() - .5n", "x_max = X[:, 0].max() + .5n", "x_mean = X[:, 0].mean()n", "n", "# display our resultsn", "x_count, x_min, x_max, x_mean" ] } Python pandas in Jupyter One of the most widely used features of Python is pandas. pandas are built-in libraries of data analysis packages that can be used freely. In this example, we will develop a Python script that uses pandas to see if there is any effect to using them in Jupyter. I am using the Titanic dataset from http://www.kaggle.com/c/titanic-gettingStarted/download/train.csv. I am sure the same data is available from a variety of sources. Here is our Python script that we want to run in Jupyter: from pandas import * training_set = read_csv('train.csv') training_set.head() male = training_set[training_set.sex == 'male'] female = training_set[training_set.sex =='female'] womens_survival_rate = float(sum(female.survived))/len(female) mens_survival_rate = float(sum(male.survived))/len(male) The result is… we calculate the survival rates of the passengers based on sex. We create a new notebook, enter the script into appropriate cells, include adding displays of calculated data at each point and produce our results. Here is our notebook laid out where we added displays of calculated data at each cell,as shown in the following screenshot: When I ran this script, I had two problems: On Windows, it is common to use backslash ("") to separate parts of a filename. However, this coding uses the backslash as a special character. So, I had to change over to use forward slash ("/") in my CSV file path. I originally had a full path to the CSV in the above code example. The dataset column names are taken directly from the file and are case sensitive. In this case, I was originally using the 'sex' field in my script, but in the CSV file the column is named Sex. Similarly I had to change survived to Survived. The final script and result looks like the following screenshot when we run it: I have used the head() function to display the first few lines of the dataset. It is interesting… the amount of detail that is available for all of the passengers. If you scroll down, you see the results as shown in the following screenshot: We see that 74% of the survivors were women versus just 19% men. I would like to think chivalry is not dead! Curiously the results do not total to 100%. However, like every other dataset I have seen, there is missing and/or inaccurate data present. Python graphics in Jupyter How do Python graphics work in Jupyter? I started another view for this named Python Graphics so as to distinguish the work. If we were to build a sample dataset of baby names and the number of births in a year of that name, we could then plot the data. The Python coding is simple: import pandas import matplotlib %matplotlib inline baby_name = ['Alice','Charles','Diane','Edward'] number_births = [96, 155, 66, 272] dataset = list(zip(baby_name,number_births)) df = pandas.DataFrame(data = dataset, columns=['Name', 'Number']) df['Number'].plot() The steps of the script are as follows: We import the graphics library (and data library) that we need Define our data Convert the data into a format that allows for easy graphical display Plot the data We would expect a resultant graph of the number of births by baby name. Taking the above script and placing it into cells of our Jupyter node, we get something that looks like the following screenshot: I have broken the script into different cells for easier readability. Having different cells also allows you to develop the script easily step by step, where you can display the values computed so far to validate your results. I have done this in most of the cells by displaying the dataset and DataFrame at the bottom of those cells. When we run this script (Cell | Run All), we see the results at each step displayed as the script progresses: And finally we see our plot of the births as shown in the following screenshot. I was curious what metadata was stored for this script. Looking into the IPYNB file, you can see the expected value for the formula cells. The tabular data display of the DataFrame is stored as HTML—convenient: { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>n", "<table border="1" class="dataframe">n", "<thead>n", "<tr style="text-align: right;">n", "<th></th>n", "<th>Name</th>n", "<th>Number</th>n", "</tr>n", "</thead>n", "<tbody>n", "<tr>n", "<th>0</th>n", "<td>Alice</td>n", "<td>96</td>n", "</tr>n", "<tr>n", "<th>1</th>n", "<td>Charles</td>n", "<td>155</td>n", "</tr>n", "<tr>n", "<th>2</th>n", "<td>Diane</td>n", "<td>66</td>n", "</tr>n", "<tr>n", "<th>3</th>n", "<td>Edward</td>n", "<td>272</td>n", "</tr>n", "</tbody>n", "</table>n", "</div>" ], "text/plain": [ " Name Numbern", "0 Alice 96n", "1 Charles 155n", "2 Diane 66n", "3 Edward 272" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], The graphic output cell that is stored like this: { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x47cf8f0>" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "<a few hundred lines of hexcodes> …/wc/B0RRYEH0EQAAAABJRU5ErkJggg==n", "text/plain": [ "<matplotlib.figure.Figure at 0x47d8e30>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot the datan", "df['Number'].plot()n" ] } ], Where the image/png tag contains a large hex digit string representation of the graphical image displayed on screen (I abbreviated the display in the coding shown). So, the actual generated image is stored in the metadata for the page. Python random numbers in Jupyter For many analyses we are interested in calculating repeatable results. However, much of the analysis relies on some random numbers to be used. In Python, you can set the seed for the random number generator to achieve repeatable results with the random_seed() function. In this example, we simulate rolling a pair of dice and looking at the outcome. We would example the average total of the two dice to be 6—the halfway point between the faces. The script we are using is this: import pylab import random random.seed(113) samples = 1000 dice = [] for i in range(samples): total = random.randint(1,6) + random.randint(1,6) dice.append(total) pylab.hist(dice, bins= pylab.arange(1.5,12.6,1.0)) pylab.show() Once we have the script in Jupyter and execute it, we have this result: I had added some more statistics. Not sure if I would have counted on such a high standard deviation. If we increased the number of samples, this would decrease. The resulting graph was opened in a new window, much as it would if you ran this script in another Python development environment. The toolbar at the top of the graphic is extensive, allowing you to manipulate the graphic in many ways. Summary In this article, we walked through simple data access in Jupyter through Python. Then we saw an example of using pandas. We looked at a graphics example. Finally, we looked at an example using random numbers in a Python script. Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Mining Twitter with Python – Influence and Engagement [article] Unsupervised Learning [article]
Read more
  • 0
  • 0
  • 34017

article-image-prepare-for-2017-with-mapt
Packt
21 Oct 2016
2 min read
Save for later

Prepare for our 2017 Awards with Mapt

Packt
21 Oct 2016
2 min read
At Packt, we're committed to supporting developers to learn the skills they need to remain relevant in their field. But what exactly does relevant mean? To us, relevance is about the impact you have. And we believe that software should have always have an impact, whether it's for a business, for customers - whoever it is, it's ultimately about making a difference. We want to reward developers who make an impact. Whether you're a web developer who's creating awesome applications and websites that are engaging users every single day, or even a data analyst who has used Machine Learning to uncover revealing insights about healthcare or the environment, we're going to want to hear from you. We don't want to give too much away right now, but we're confident that you're going to be interested in our award... So, to prepare yourself for our awards, get started on Mapt and find your route through some of the most important skills in software today. What are you waiting for? We're sponsoring seats on Mapt for limited prices this week. That means you'll be able to get a subscription for a special discounted price - but be quick, each discount is time limited! Subscribe here.
Read more
  • 0
  • 0
  • 1783
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-resolving-deadlock-hbase
Ted Yu
20 Oct 2016
4 min read
Save for later

Resolving Deadlock in HBase

Ted Yu
20 Oct 2016
4 min read
In this post, I will walk you through how to resolve a tricky deadlock scenario when using HBase. To get a better idea of the details for this scenario, take a look at the following JIRA. This tricky scenario relates to HBASE-13651, which tried to handle the case where one region server removes the compacted hfiles, leading to FileNotFoundExceptions on another machine. Unlike the deadlocks that I have resolved in the past, this deadlock rarely happens, but it occurs when one thread tries to obtain a write lock, while the other thread holds a read lock of the same ReentrantReadWriteLock. Understanding the Scenario To fully understand this scenario, let's go ahead and take a look at a concrete example.  For handler 12, HRegion#refreshStoreFiles() obtains a lock on writestate (line 4919).And then it tries to get the write lock of updatesLock (a ReentrantReadWriteLock) in dropMemstoreContentsForSeqId(): "B.defaultRpcServer.handler=12,queue=0,port=16020" daemon prio=10 tid=0x00007f205cf8d000nid=0x8f0b waiting on condition [0x00007f203ea85000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for<0x00000006708113c8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:945) at org.apache.hadoop.hbase.regionserver.HRegion.dropMemstoreContentsForSeqId(HRegion.java:4568) at org.apache.hadoop.hbase.regionserver.HRegion.refreshStoreFiles(HRegion.java:4919) - locked <0x00000006707c3500> (a org.apache.hadoop.hbase.regionserver.HRegion$WriteState) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleFileNotFound(HRegion.java:6104) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5875) For handler 24, HRegion$RegionScannerImpl.next() gets a read lock, and tries to obtain a lock on writestate in handleFileNotFound(): "B.defaultRpcServer.handler=24,queue=0,port=16020" daemon prio=10 tid=0x00007f205cfa6000nid=0x8f17 waiting for monitor entry [0x00007f203de79000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hbase.regionserver.HRegion.refreshStoreFiles(HRegion.java:4887) - waiting to lock <0x00000006707c3500> (a org.apache.hadoop.hbase.regionserver.HRegion$WriteState) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleFileNotFound(HRegion.java:6104) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5736) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5875) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5653) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5630) - locked <0x00000007130162c8> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5616) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6810) at org.apache.hadoop.hbase.regionserver.HRegion.getIncrementCurrentValue(HRegion.java:7673) at org.apache.hadoop.hbase.regionserver.HRegion.applyIncrementsToColumnFamily(HRegion.java:7583) at org.apache.hadoop.hbase.regionserver.HRegion.doIncrement(HRegion.java:7480) at org.apache.hadoop.hbase.regionserver.HRegion.increment(HRegion.java:7440) As you can see, these two handlers get into a deadlock. Fixing the Deadlock So, how can you fix this? The fix breaks the deadlock (on hander 12) by remembering the tuples to be passed to dropMemstoreContentsForSeqId(), releasing the read lock and then calling dropMemstoreContentsForSeqId(). Mingmin, the reporter of the bug, kindly added a patched JAR onto his production cluster so that the deadlock of this form no longer exists. Take a look at this. I hope my experiences encountering this tricky situation will be of some help to you in the event that you see such a scenario in the future. About the author Ted Yu is astaff engineer at HortonWorks. He has also been an HBase committer/PMC for 5years. His work on HBase covers various components: security, backup/restore, load balancer, MOB, and so on. He has provided support for customers at eBay, Micron, PayPal, and JPMC. He is also a Spark contributor.
Read more
  • 0
  • 0
  • 3351

article-image-introduction-moodle-3-and-moodlecloud
Packt
19 Oct 2016
20 min read
Save for later

An Introduction to Moodle 3 and MoodleCloud

Packt
19 Oct 2016
20 min read
In this article by Silvina Paola Hillar, the author of the book Moodle Theme Development we will introduce e-learning and virtual learning environments such as Moodle and MoodleCloud, explaining their similarities and differences. Apart from that, we will learn and understand screen resolution and aspect ratio, which is the information we need in order to develop Moodle themes. In this article, we shall learn the following topics: Understanding what e-learning is Learning about virtual learning environments Introducing Moodle and MoodleCloud Learning what Moodle and MoodleCloud are Using Moodle on different devices Sizing the screen resolution Calculating the aspect ratio Learning about sharp and soft images Learning about crisp and sharp text Understanding what anti-aliasing is (For more resources related to this topic, see here.) Understanding what e-learning is E-learning is electronic learning, meaning that it is not traditional learning in a classroom with a teacher and students, plus the board. E-learning involves using a computer to deliver classes or a course. When delivering classes or a course, there is online interaction between the student and the teacher. There might also be some offline activities, when a student is asked to create a piece of writing or something else. Another option is that there are collaboration activities involving the interaction of several students and the teacher. When creating course content, there is the option of video conferencing as well. So there is virtual face-to-face interaction within the e-learning process. The time and the date should be set beforehand. In this way, e-learning is trying to imitate traditional learning to not lose human contact or social interaction. The course may be full distance or not. If the course is full distance, there is online interaction only. All the resources and activities are delivered online and there might be some interaction through messages, chats, or emails between the student and the teacher. If the course is not full distance, and is delivered face to face but involving the use of computers, we are referring to blended learning. Blended learning means using e-learning within the classroom, and is a mixture of traditional learning and computers. The usage of blended learning with little children is very important, because they get the social element, which is essential at a very young age. Apart from that, they also come into contact with technology while they are learning. It is advisable to use interactive whiteboards (IWBs) at an early stage. IWBs are the right tool to choose when dealing with blended learning. IWBs are motivational gadgets, which are prominent in integrating technology into the classroom. IWBs are considered a symbol of innovation and a key element of teaching students. IWBs offer interactive projections for class demonstrations; we can usually project resources from computer software as well as from our Moodle platform. Students can interact with them by touching or writing on them, that is to say through blended learning. Apart from that, teachers can make presentations on different topics within a subject and these topics become much more interesting and captivating for students, since IWBs allows changes to be made and we can insert interactive elements into the presentation of any subject. There are several types of technology used in IWBs, such as touch technology, laser scanning, and electromagnetic writing tools. Therefore, we have to bear in mind which to choose when we get an IWB. On the other hand, the widespread use of mobile devices nowadays has turned e-learning into mobile learning. Smartphones and tablets allows students to learn anywhere at any time. Therefore, it is important to design course material that is usable by students on such devices. Moodle is a learning platform through which we can design, build and create e-learning environments. It is possible to create online interaction and have video conferencing sessions with students. Distance learning is another option if blended learning cannot be carried out. We can also choose Moodle mobile. We can download the app from App Store, Google Play, Windows Store, or Windows Phone Store. We can browse the content of courses, receive messages, contact people from the courses, upload different types of file, and view course grades, among other actions. Learning about Virtual Learning Environments Virtual Learning Environment (VLE) is a type of virtual environment that supports both resources and learning activities; therefore, students can have both passive and active roles. There is also social interaction, which can take place through collaborative work as well as video conferencing. Students can also be actors, since they can also construct the VLE. VLEs can be used for both distance and blended learning, since they can enrich courses. Mobile learning is also possible because mobile devices have access to the Internet, allowing teachers and students to log in to their courses. VLEs are designed in such a way that they can carry out the following functions or activities: Design, create, store, access, and use course content Deliver or share course content Communicate, interact, and collaborate between students and teachers Assess and personalize the learning experience Modularize both activities and resources Customize the interface We are going to deal with each of these functions and activities and see how useful they might be when designing our VLE for our class. When using Moodle, we can perform all the functions and activities mentioned here, because Moodle is a VLE. Design, create, store, access and use course content If we use the Moodle platform to create the course, we have to deal with course content. Therefore, when we add a course, we have to add its content. We can choose the weekly outline section or the topic under which we want to add the content. We click on Add an activity or resource and two options appear, resources and activities; therefore, the content can be passive or active for the student. When we create or design activities in Moodle, the options are shown in the following screenshot: Another option for creating course content is to reuse content that has already been created and used before in another VLE. In other words, we can import or export course materials, since most VLEs have specific tools designed for such purposes. This is very useful and saves time. There are a variety of ways for teachers to create course materials, due to the fact that the teacher thinks of the methodology, as well as how to meet the student's needs, when creating the course. Moodle is designed in such a way that it offers a variety of combinations that can fit any course content. Deliver or share course content Before using VLEs, we have to log in, because all the content is protected and is not open to the general public. In this way, we can protect property rights, as well as the course itself. All participants must be enrolled in the course unless it has been opened to the public. Teachers can gain remote access in order to create and design their courses. This is quite profitable since they can build the content at home, rather than in their workplace. They need login access and they need to switch roles to course creator in order to create the content. Follow these steps to switch roles to course creator: Under Administration, click on Switch role to… | Course creator, as shown in the following screenshot: When the role has been changed, the teacher can create content that students can access. Once logged in, students have access to the already created content, either activities or resources. The content is available over the Internet or the institution's intranet connection. Students can access the content anywhere if any of these connections are available. If MoodleCloud is being used, there must be an Internet connection, otherwise it is impossible for both students and teachers to log in. Communicate, interact, and collaborate among students and teachers Communication, interaction, and collaborative working are the key factors to social interaction and learning through interchanging ideas. VLEs let us create course content activities, because these actions are allows they are elemental for our class. There is no need to be an isolated learner, because learners have the ability to communicate between themselves and with the teachers. Moodle offers the possibility of video conferencing through the Big Blue Button. In order to install the Big Blue Button plugin in Moodle, visit the following link:https://moodle.org/plugins/browse.php?list=set&id=2. This is shown in the following screenshot: If you are using MoodleCloud, the Big Blue Button is enabled by default, so when we click on Add an activity or resource it appears in the list of activities, as shown in the following screenshot: Assess and personalize the learning experience Moodle allows the teacher to follow the progress of students so that they can assess and grade their work, as long as they complete the activities. Resources cannot be graded, since they are passive content for students, but teachers can also check when a participant last accessed the site. Badges are another element used to personalize the learning experience. We can create badges for students when they complete an activity or a course; they are homework rewards. Badges are quite good at motivating young learners. Modularize both activities and resources Moodle offers the ability to build personalized activities and resources. There are several ways to present both, with all the options Moodle offers. Activities can be molded according to the methodology the teacher uses. In Moodle 3, there are new question types within the Quiz activity. The question types are as follows: Select missing words Drag and drop into text Drag and drop onto image Drag and drop markers The question types are shown after we choose Quiz in the Add a resource or Activity menu, in the weekly outline section or topic that we have chosen. The types of question are shown in the following screenshot: Customize the interface Moodle allows us to customize the interface in order to develop the look and feel that we require; we can add a logo for the school or institution that the Moodle site belongs to. We can also add another theme relevant to the subject or course that we have created. The main purpose in customizing the interface is to avoid all subjects and courses looking the same. Later in the article, we will learn how to customize the interface. Learning Moodle and MoodleCloud Modular Object-Oriented Dynamic Learning Environment (Moodle) is a learning platform designed in such a way that we can create VLEs. Moodle can be downloaded, installed and run on any web server software using Hypertext Preprocessor (PHP). It can support a SQL database and can run on several operating systems. We can download Moodle 3.0.3 from the following URL: https://download.moodle.org/. This URL is shown in the following screenshot: MoodleCloud, on the other hand, does not need to be downloaded since, as its name suggests, is in the cloud. Therefore, we can get our own Moodle site with MoodleCloud within minutes and for free. It is Moodle's hosting platform, designed and run by the people who make Moodle. In order to get a MoodleCloud site, we need to go to the following URL: https://moodle.com/cloud/. This is shown in the following screenshot: MoodleCloud was created in order to cater for users with fewer requirements and small budgets. In order to create an account, you need to add your cell phone number to receive an SMS which we must be input when creating your site. As it is free, there are some limitations to MoodleCloud, unless we contact Moodle Partners and pay for an expanded version of it. The limitations are as follows: No more than 50 users 200 MB disk space Core themes and plugins only One site per phone number Big Blue Button sessions are limited to 6 people, with no recordings There are advertisements When creating a Moodle site, we want to change the look and functionality of the site or individual course. We may also need to customize themes for Moodle, in order to give the course the desired look. Therefore, this article will explain the basic concepts that we have to bear in mind when dealing with themes, due to the fact that themes are shown in different devices. In the past, Moodle ran only on desktops or laptops, but nowadays Moodle can run on many different devices, such as smartphones, tablets, iPads, and smart TVs, and the list goes on. Using Moodle on different devices Moodle can be used on different devices, at different times, in different places. Therefore, there are factors that we need to be aware of when designing courses and themes.. Therefore, here after in this article, there are several aspects and concepts that we need to deepen into in order to understand what we need to take into account when we design our courses and build our themes. Devices change in many ways, not only in size but also in the way they display our Moodle course. Moodle courses can be used on anything from a tiny device that fits into the palm of a hand to a huge IWB or smart TV, and plenty of other devices in between. Therefore, such differences have to be taken into account when choosing images, text, and other components of our course. We are going to deal with sizing screen resolution, calculating the aspect ratio, types of images such as sharp and soft, and crisp and sharp text. Finally, but importantly, the anti-aliasing method is explained. Sizing the screen resolution Number of pixels the display of device has, horizontally and vertically and the color depth measuring the number of bits representing the color of each pixel makes up the screen resolution. The higher the screen resolution, the higher the productivity we get. In the past, the screen resolution of a display was important since it determined the amount of information displayed on the screen. The lower the resolution, the fewer items would fit on the screen; the higher the resolution, the more items would fit on the screen. The resolution varies according to the hardware in each device. Nowadays, the screen resolution is considered a pleasant visual experience, since we would rather see more quality than more stuff on the screen. That is the reason why the screen resolution matters. There might be different display sizes where the screen resolutions are the same, that is to say, the total number of pixels is the same. If we compare a laptop (13'' screen with a resolution of 1280 x 800) and a desktop (with a 17'' monitor with the same 1280 x 800 resolution), although the monitor is larger, the number of pixels is the same; the only difference is that we will be able to see everything bigger on the monitor. Therefore, instead of seeing more stuff, we see higher quality. Screen resolution chart Code Width Height Ratio Description QVGA 320 240 4:3 Quarter Video Graphics Array FHD 1920 1080 ~16:9 Full High Definition HVGA 640 240 8:3 Half Video Graphics Array HD 1360 768 ~16:9 High Definition HD 1366 768 ~16:9 High Definition HD+ 1600 900 ~16:9 High Definition plus VGA 640 480 4:3 Video Graphics Array SVGA 800 600 4:3 Super Video Graphics Array XGA 1024 768 4:3 Extended Graphics Array XGA+ 1152 768 3:2 Extended Graphics Array plus XGA+ 1152 864 4:3 Extended Graphics Array plus SXGA 1280 1024 5:4 Super Extended Graphics Array SXGA+ 1400 1050 4:3 Super Extended Graphics Array plus UXGA 1600 1200 4:3 Ultra Extended Graphics Array QXGA 2048 1536 4:3 Quad Extended Graphics Array WXGA 1280 768 5:3 Wide Extended Graphics Array WXGA 1280 720 ~16:9 Wide Extended Graphics Array WXGA 1280 800 16:10 Wide Extended Graphics Array WXGA 1366 768 ~16:9 Wide Extended Graphics Array WXGA+ 1280 854 3:2 Wide Extended Graphics Array plus WXGA+ 1440 900 16:10 Wide Extended Graphics Array plus WXGA+ 1440 960 3:2 Wide Extended Graphics Array plus WQHD 2560 1440 ~16:9 Wide Quad High Definition WQXGA 2560 1600 16:10 Wide Quad Extended Graphics Array WSVGA 1024 600 ~17:10 Wide Super Video Graphics Array WSXGA 1600 900 ~16:9 Wide Super Extended Graphics Array WSXGA 1600 1024 16:10 Wide Super Extended Graphics Array WSXGA+ 1680 1050 16:10 Wide Super Extended Graphics Array plus WUXGA 1920 1200 16:10 Wide Ultra Extended Graphics Array WQXGA 2560 1600 16:10 Wide Quad Extended Graphics Array WQUXGA 3840 2400 16:10 Wide Quad Ultra Extended Graphics Array 4K UHD 3840 2160 16:9 Ultra High Definition 4K UHD 1536 864 16:9 Ultra High Definition Considering that 3840 x 2160 displays (also known as 4K, QFHD, Ultra HD, UHD, or 2160p) are already available for laptops and monitors, a pleasant visual experience with high DPI displays can be a good long-term investment for your desktop applications. The DPI setting for the monitor causes another common problem. The change in the effective resolution. Consider the 13.3" display that offers a 3200 x1800 resolution and is configured with an OS DPI of 240 DPI. The high DPI setting makes the system use both larger fonts and UI elements; therefore, the elements consume more pixels to render than the same elements displayed in the resolution configured with an OS DPI of 96 DPI. The effective resolution of a display that provides 3200 x1800 pixels configured at 240 DPI is 1280 x 720. The effective resolution can become a big problem because an application that requires a minimum resolution of the old standard 1024 x 768 pixels with an OS DPI of 96 would have problems with a 3200 x 1800-pixel display configured at 240 DPI, and it wouldn't be possible to display all the necessary UI elements. It may sound crazy, but the effective vertical resolution is 720 pixels, lower than the 768 vertical pixels required by the application to display all the UI elements without problems. The formula to calculate the effective resolution is simple: divide the physical pixels by the scale factor (OS DPI / 96). For example, the following formula calculates the horizontal effective resolution of my previous example: 3200 / (240 / 96) = 3200 / 2.5 = 1280; and the following formula calculates the vertical effective resolution: 1800 / (240 / 96) = 1800 / 2.5 = 720. The effective resolution would be of 1800 x 900 pixels if the same physical resolution were configured at 192 DPI. Effective horizontal resolution: 3200 / (192 / 96) = 3200 / 2 = 1600; and vertical effective resolution: 1800 / (192 / 96) = 1800 / 2 = 900. Calculating the aspect ratio The aspect radio is the proportional relationship between the width and the height of an image. It is used to describe the shape of a computer screen or a TV. The aspect ratio of a standard-definition (SD) screen is 4:3, that is to say, a relatively square rectangle. The aspect ratio is often expressed in W:H format, where W stands for width and H stands for height. 4:3 means four units wide to three units high. With regards to high-definition TV (HDTV), they have a 16:9 ratio, which is a wider rectangle. Why do we calculate the aspect ratio? The answer to this question is that the ratio has to be well defined because the rectangular shape that every frame, digital video, canvas, image, or responsive design has, makes shapes fit into different and distinct devices. Learning about sharp and soft images Images can be either sharp or soft. Sharp is the opposite of soft. A soft image has less pronounced details, while a sharp image has more contrast between pixels. The more pixels the image has, the sharper it is. We can soften the image, in which case it loses information, but we cannot sharpen one; in other words, we can't add more information to an image. In order to compare sharp and soft images, we can visit the following website, where we can convert bitmaps to vector graphics. We can convert a bitmap images such as .png, .jpeg, or .gif into a .svg in order to get an anti-aliased image. We can do this with a simple step. We work with an online tool to vectorize the bitmap using http://vectormagic.com/home. There are plenty of features to take into account when vectorizing. We can design a bitmap using an image editor and upload the bitmap image from the clipboard, or upload the file from our computer. Once the image is uploaded to the application, we can start working. Another possibility is to use the sample images on the website, which we are going to use in order to see that anti-aliasing effect. We convert bitmap images, which are made up of pixels, into vector images, which are made up of shapes. The shapes are mathematical descriptions of images and do not become pixelated when scaling up. Vector graphics can handle scaling without any problems. Vector images are the preferred type to work with in graphic design on paper or clothes. Go to http://vectormagic.com/home and click on Examples, as shown in the following screenshot: After clicking on Examples, the bitmap appears on the left and the vectorized image on the right. The bitmap is blurred and soft; the SVG has an anti-aliasing effect, therefore the image is sharp. The result is shown in the following screenshot: Learning about crisp and sharp text There are sharp and soft images, and there is also crisp and sharp text, so it is now time to look at text. What is the main difference between these two? When we say that the text is crisp, we mean that there is more anti-aliasing, in other words it has more grey pixels around the black text. The difference is shown when we zoom in to 400%. On the other hand, sharp mode is superior for small fonts because it makes each letter stronger. There are four options in Photoshop to deal with text: sharp, crisp, strong, and smooth. Sharp and crisp have already been mentioned in the previous paragraphs. Strong is notorious for adding unnecessary weight to letter forms, and smooth looks closest to the untinted anti-aliasing, and it remains similar to the original. Understanding what anti-aliasing is The word anti-aliasing means the technique used in order to minimize the distortion artifacts. It applies intermediate colors in order to eliminate pixels, that is to say the saw-tooth or pixelated lines. Therefore, we need to look for a lower resolution so that the saw-tooth effect does not appear when we make the graphic bigger. Test your knowledge Before we delve deeper into more content, let's test your knowledge about all the information that we have dealt with in this article: Moodle is a learning platform with which… We can design, build and create E-learning environments. We can learn. We can download content for students. BigBlueButtonBN… Is a way to log in to Moodle. Lets you create links to real-time online classrooms from within Moodle. Works only in MoodleCloud. MoodleCloud… Is not open source. Does not allow more than 50 users. Works only for universities. The number of pixels the display of the device has horizontally and vertically, and the color depth measuring the number of bits representing the color of each pixel, make up… Screen resolution. Aspect ratio. Size of device. Anti-aliasing can be applied to … Only text. Only images. Both images and text. Summary In this article, we have covered most of what needs to be known about e-learning, VLEs, and Moodle and MoodleCloud. There is a slight difference between Moodle and MoodleCloud specially if you don't have access to a Moodle course in the institution where you are working and want to design a Moodle course. Moodle is used in different devices and there are several aspects to take into account when designing a course and building a Moodle theme. We have dealt with screen resolution, aspect ratio, types of images and text, and anti-aliasing effects. Resources for Article: Further resources on this subject: Listening Activities in Moodle 1.9: Part 2 [article] Gamification with Moodle LMS [article] Adding Graded Activities [article]
Read more
  • 0
  • 0
  • 14782

article-image-heart-diseases-prediction-using-spark-200
Packt
18 Oct 2016
16 min read
Save for later

Heart Diseases Prediction using Spark 2.0.0

Packt
18 Oct 2016
16 min read
In this article, Md. Rezaul Karim and Md. Mahedi Kaysar, the authors of the book Large Scale Machine Learning with Spark discusses how to develop a large scale heart diseases prediction pipeline by considering steps like taking input, parsing, making label point for regression, model training, model saving and finally predictive analytics using the trained model using Spark 2.0.0. In this article, they will develop a large-scale machine learning application using several classifiers like the random forest, decision tree, and linear regression classifier. To make this happen the following steps will be covered: Data collection and exploration Loading required packages and APIs Creating an active Spark session Data parsing and RDD of Label point creation Splitting the RDD of label point into training and test set Training the model Model saving for future use Predictive analysis using the test set Predictive analytics using the new dataset Performance comparison among different classifier (For more resources related to this topic, see here.) Background Machine learning in big data together is a radical combination that has created some great impacts in the field of research to academia and industry as well in the biomedical sector. In the area of biomedical data analytics, this carries a better impact on a real dataset for diagnosis and prognosis for better healthcare. Moreover, the life science research is also entering into the Big data since datasets are being generated and produced in an unprecedented way. This imposes great challenges to the machine learning and bioinformatics tools and algorithms to find the VALUE out of the big data criteria like volume, velocity, variety, veracity, visibility and value. In this article, we will show how to predict the possibility of future heart disease by using Spark machine learning APIs including Spark MLlib, Spark ML, and Spark SQL. Data collection and exploration In the recent time, biomedical research has gained lots of advancement and more and more life sciences data set are being generated making many of them open. However, for the simplicity and ease, we decided to use the Cleveland database. Because till date most of the researchers who have applied the machine learning technique to biomedical data analytics have used this dataset. According to the dataset description at https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names, the heart disease dataset is one of the most used as well as very well-studied datasets by the researchers from the biomedical data analytics and machine learning respectively. The dataset is freely available at the UCI machine learning dataset repository at https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. This data contains total 76 attributes, however, most of the published research papers refer to use a subset of only 14 feature of the field. The goal field is used to refer if the heart diseases are present or absence. It has 5 possible values ranging from 0 to 4. The value 0 signifies no presence of heart diseases. The value 1 and 2 signify that the disease is present but in the primary stage. The value 3 and 4, on the other hand, indicate the strong possibility of the heart disease. Biomedical laboratory experiments with the Cleveland dataset have determined on simply attempting to distinguish presence (values 1, 2, 3, 4) from absence (value 0). In short, the more the value the more possibility and evidence of the presence of the disease. Another thing is that the privacy is an important concern in the area of biomedical data analytics as well as all kind of diagnosis and prognosis. Therefore, the names and social security numbers of the patients were recently removed from the dataset to avoid the privacy issue. Consequently, those values have been replaced with dummy values instead. It is to be noted that three files have been processed, containing the Cleveland, Hungarian, and Switzerland datasets altogether. All four unprocessed files also exist in this directory. To demonstrate the example, we will use the Cleveland dataset for training evaluating the models. However, the Hungarian dataset will be used to re-use the saved model. As said already that although the number of attributes is 76 (including the predicted attribute). However, like other ML/Biomedical researchers, we will also use only 14 attributes with the following attribute information:  No. Attribute name Explanation 1 age Age in years 2 sex Either male or female: sex (1 = male; 0 = female) 3 cp Chest pain type: -- Value 1: typical angina -- Value 2: atypical angina -- Value 3: non-angina pain -- Value 4: asymptomatic 4 trestbps Resting blood pressure (in mm Hg on admission to the hospital) 5 chol Serum cholesterol in mg/dl 6 fbs Fasting blood sugar. If > 120 mg/dl)(1 = true; 0 = false) 7 restecg Resting electrocardiographic results: -- Value 0: normal -- Value 1: having ST-T wave abnormality -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria. 8 thalach Maximum heart rate achieved 9 exang Exercise induced angina (1 = yes; 0 = no) 10 oldpeak ST depression induced by exercise relative to rest 11 slope The slope of the peak exercise ST segment    -- Value 1: upsloping    -- Value 2: flat    -- Value 3: down-sloping 12 ca Number of major vessels (0-3) coloured by fluoroscopy 13 thal Heart rate: ---Value 3 = normal; ---Value 6 = fixed defect ---Value 7 = reversible defect 14 num Diagnosis of heart disease (angiographic disease status) -- Value 0: < 50% diameter narrowing -- Value 1: > 50% diameter narrowing Table 1: Dataset characteristics Note there are several missing attribute values distinguished with value -9.0. In the Cleveland dataset contains the following class distribution: Database:     0       1     2     3   4   Total Cleveland:   164   55   36   35 13   303 A sample snapshot of the dataset is given as follows: Figure 1: Snapshot of the Cleveland's heart diseases dataset Loading required packages and APIs The following packages and APIs need to be imported for our purpose. We believe the packages are self-explanatory if you have minimum working experience with Spark 2.0.0.: import java.util.HashMap; import java.util.List; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.ml.classification.LogisticRegression; import org.apache.spark.mllib.classification.LogisticRegressionModel; import org.apache.spark.mllib.classification.NaiveBayes; import org.apache.spark.mllib.classification.NaiveBayesModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.regression.LinearRegressionModel; import org.apache.spark.mllib.regression.LinearRegressionWithSGD; import org.apache.spark.mllib.tree.DecisionTree; import org.apache.spark.mllib.tree.RandomForest; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.rdd.RDD; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import com.example.SparkSession.UtilityForSparkSession; import javassist.bytecode.Descriptor.Iterator; import scala.Tuple2; Creating an active Spark session SparkSession spark = UtilityForSparkSession.mySession(); Here is the UtilityForSparkSession class that creates and returns an active Spark session: import org.apache.spark.sql.SparkSession; public class UtilityForSparkSession { public static SparkSession mySession() { SparkSession spark = SparkSession .builder() .appName("UtilityForSparkSession") .master("local[*]") .config("spark.sql.warehouse.dir", "E:/Exp/") .getOrCreate(); return spark; } } Note that here in Windows 7 platform, we have set the Spark SQL warehouse as "E:/Exp/", set your path accordingly based on your operating system. Data parsing and RDD of Label point creation Taken input as simple text file, parse them as text file and create RDD of label point that will be used for the classification and regression analysis. Also specify the input source and number of partition. Adjust the number of partition based on your dataset size. Here number of partition has been set to 2: String input = "heart_diseases/processed_cleveland.data"; Dataset<Row> my_data = spark.read().format("com.databricks.spark.csv").load(input); my_data.show(false); RDD<String> linesRDD = spark.sparkContext().textFile(input, 2); Since, JavaRDD cannot be created directly from the text files; rather we have created the simple RDDs, so that we can convert them as JavaRDD when necessary. Now let's create the JavaRDD with Label Point. However, we need to convert the RDD to JavaRDD to serve our purpose that goes as follows: JavaRDD<LabeledPoint> data = linesRDD.toJavaRDD().map(new Function<String, LabeledPoint>() { @Override public LabeledPoint call(String row) throws Exception { String line = row.replaceAll("\?", "999999.0"); String[] tokens = line.split(","); Integer last = Integer.parseInt(tokens[13]); double[] features = new double[13]; for (int i = 0; i < 13; i++) { features[i] = Double.parseDouble(tokens[i]); } Vector v = new DenseVector(features); Double value = 0.0; if (last.intValue() > 0) value = 1.0; LabeledPoint lp = new LabeledPoint(value, v); return lp; } }); Using the replaceAll() method we have handled the invalid values like missing values that are specified in the original file using ? character. To get rid of the missing or invalid value we have replaced them with a very large value that has no side-effect to the original classification or predictive results. The reason behind this is that missing or sparse data can lead you to highly misleading results. Splitting the RDD of label point into training and test set Well, in the previous step, we have created the RDD label point data that can be used for the regression or classification task. Now we need to split the data as training and test set. That goes as follows: double[] weights = {0.7, 0.3}; long split_seed = 12345L; JavaRDD<LabeledPoint>[] split = data.randomSplit(weights, split_seed); JavaRDD<LabeledPoint> training = split[0]; JavaRDD<LabeledPoint> test = split[1]; If you see the preceding code segments, you will find that we have split the RDD label point as 70% as the training and 30% goes to the test set. The randomSplit() method does this split. Note that, set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. The split seed value is a long integer that signifies that split would be random but the result would not be a change in each run or iteration during the model building or training. Training the model and predict the heart diseases possibility At the first place, we will train the linear regression model which is the simplest regression classifier. final double stepSize = 0.0000000009; final int numberOfIterations = 40; LinearRegressionModel model = LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numberOfIterations, stepSize); As you can see the preceding code trains a linear regression model with no regularization using Stochastic Gradient Descent. This solves the least squares regression formulation f (weights) = 1/n ||A weights-y||^2^; which is the mean squared error. Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with its corresponding right-hand side label y. Also to train the model it takes the training set, number of iteration and the step size. We provide here some random value of the last two parameters. Model saving for future use Now let's save the model that we just created above for future use. It's pretty simple just use the following code by specifying the storage location as follows: String model_storage_loc = "models/heartModel"; model.save(spark.sparkContext(), model_storage_loc); Once the model is saved in your desired location, you will see the following output in your Eclipse console: Figure 2: The log after model saved to the storage Predictive analysis using the test set Now let's calculate the prediction score on the test dataset: JavaPairRDD<Double, Double> predictionAndLabel = test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { @Override public Tuple2<Double, Double> call(LabeledPoint p) { return new Tuple2<>(model.predict(p.features()), p.label()); } }); Predict the accuracy of the prediction: double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { @Override public Boolean call(Tuple2<Double, Double> pl) { return pl._1().equals(pl._2()); } }).count() / (double) test.count(); System.out.println("Accuracy of the classification: "+accuracy); The output goes as follows: Accuracy of the classification: 0.0 Performance comparison among different classifier Unfortunately, there is no prediction accuracy at all, right? There might be several reasons for that, including: The dataset characteristic Model selection Parameters selection, that is, also called hyperparameter tuning Well, for the simplicity, we assume the dataset is okay; since, as already said that it is a widely used dataset used for machine learning research used by many researchers around the globe. Now, what next? Let's consider another classifier algorithm for example Random forest or decision tree classifier. What about the Random forest? Lets' go for the random forest classifier at second place. Just use below code to train the model using the training set. Integer numClasses = 26; //Number of classes //HashMap is used to restrict the delicacy in the tree construction HashMap<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>(); Integer numTrees = 5; // Use more in practice. String featureSubsetStrategy = "auto"; // Let the algorithm choose the best String impurity = "gini"; // also information gain & variance reduction available Integer maxDepth = 20; // set the value of maximum depth accordingly Integer maxBins = 40; // set the value of bin accordingly Integer seed = 12345; //Setting a long seed value is recommended final RandomForestModel model = RandomForest.trainClassifier(training, numClasses,categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed); We believe the parameters user by the trainClassifier() method is self-explanatory and we leave it to the readers to get know the significance of each parameter. Fantastic! We have trained the model using the Random forest classifier and cloud manage to save the model too for future use. Now if you reuse the same code that we described in the Predictive analysis using the test set step, you should have the output as follows: Accuracy of the classification: 0.7843137254901961 Much better, right? If you are still not satisfied, you can try with another classifier model like Naïve Bayes classifier. Predictive analytics using the new dataset As we already mentioned that we have saved the model for future use, now we should take the opportunity to use the same model for new datasets. The reason is if you recall the steps, we have trained the model using the training set and evaluate using the test set. Now if you have more data or new data available to be used? Will you go for re-training the model? Of course not since you will have to iterate several steps and you will have to sacrifice valuable time and cost too. Therefore, it would be wise to use the already trained model and predict the performance on a new dataset. Well, now let's reuse the stored model then. Note that you will have to reuse the same model that is to be trained the same model. For example, if you have done the model training using the Random forest classifier and saved the model while reusing you will have to use the same classifier model to load the saved model. Therefore, we will use the Random forest to load the model while using the new dataset. Use just the following code for doing that. Now create RDD label point from the new dataset (that is, Hungarian database with same 14 attributes): String new_data = "heart_diseases/processed_hungarian.data"; RDD<String> linesRDD = spark.sparkContext().textFile(new_data, 2); JavaRDD<LabeledPoint> data = linesRDD.toJavaRDD().map(new Function<String, LabeledPoint>() { @Override public LabeledPoint call(String row) throws Exception { String line = row.replaceAll("\?", "999999.0"); String[] tokens = line.split(","); Integer last = Integer.parseInt(tokens[13]); double[] features = new double[13]; for (int i = 0; i < 13; i++) { features[i] = Double.parseDouble(tokens[i]); } Vector v = new DenseVector(features); Double value = 0.0; if (last.intValue() > 0) value = 1.0; LabeledPoint p = new LabeledPoint(value, v); return p; } }); Now let's load the saved model using the Random forest model algorithm as follows: RandomForestModel model2 = RandomForestModel.load(spark.sparkContext(), model_storage_loc); Now let's calculate the prediction on test set: JavaPairRDD<Double, Double> predictionAndLabel = data.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { @Override public Tuple2<Double, Double> call(LabeledPoint p) { return new Tuple2<>(model2.predict(p.features()), p.label()); } }); Now calculate the accuracy of the prediction as follows: double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { @Override public Boolean call(Tuple2<Double, Double> pl) { return pl._1().equals(pl._2()); } }).count() / (double) data.count(); System.out.println("Accuracy of the classification: "+accuracy); We got the following output: Accuracy of the classification: 0.7380952380952381 To get more interesting and fantastic machine learning application like spam filtering, topic modelling for real-time streaming data, handling graph data for machine learning, market basket analysis, neighborhood clustering analysis, Air flight delay analysis, making the ML application adaptable, Model saving and reusing, hyperparameter tuning and model selection, breast cancer diagnosis and prognosis, heart diseases prediction, optical character recognition, hypothesis testing, dimensionality reduction for high dimensional data, large-scale text manipulation and many more visits inside. Moreover, the book also contains how to scaling up the ML model to handle massive big dataset on cloud computing infrastructure. Furthermore, some best practice in the machine learning techniques has also been discussed. In a nutshell many useful and exciting application have been developed using the following machine learning algorithms: Linear Support Vector Machine (SVM) Linear Regression Logistic Regression Decision Tree Classifier Random Forest Classifier K-means Clustering LDA topic modelling from static and real-time streaming data Naïve Bayes classifier Multilayer Perceptron classifier for deep classification Singular Value Decomposition (SVD) for dimensionality reduction Principal Component Analysis (PCA) for dimensionality reduction Generalized Linear Regression Chi Square Test (for goodness of fit test, independence test, and feature test) KolmogorovSmirnovTest for hypothesis test Spark Core for Market Basket Analysis Multi-label classification One Vs Rest classifier Gradient Boosting classifier ALS algorithm for movie recommendation Cross-validation for model selection Train Split for model selection RegexTokenizer, StringIndexer, StopWordsRemover, HashingTF and TF-IDF for text manipulation Summary In this article we came to know that how beneficial large scale machine learning with Spark is with respect to any field. Resources for Article: Further resources on this subject: Spark for Beginners [article] Setting up Spark [article] Holistic View on Spark [article]
Read more
  • 0
  • 0
  • 4852

article-image-managing-application-configuration
Packt
18 Oct 2016
14 min read
Save for later

Managing Application Configuration

Packt
18 Oct 2016
14 min read
In this article by Sean McCord author of the book CoreOS Cookbook, we will explore some of the options available to help bridge the configuration divide with the following topics: Configuring by URL Translating etcd to configuration files Building EnvironmentFiles Building an active configuration manager Using fleet globals (For more resources related to this topic, see here.) Configuring by URL One of the most direct ways to obtain application configurations is by URL. You can generate a configuration and store it as a file somewhere, or construct a configuration from a web request, returning the formatted file. In this section, we will construct a dynamic redis configuration by web request and then run redis using it. Getting ready First, we need a configuration server. This can be S3, an object store, etcd, a NodeJS application, a rails web server, or just about anything. The details don't matter, as long as it speaks HTTP. We will construct a simple one here using Go, just in case you don't have one ready. Make sure your GOPATH is set and create a new directory named configserver. Then, create a new file in that directory called main.go with the following contents: package main import ( "html/template" "log" "net/http" ) func init() { redisTmpl = template.Must(template.New("rcfg").Parse(redisString)) } func main() { http.HandleFunc("/config/redis", redisConfig) log.Fatal(http.ListenAndServe(":8080", nil)) } func redisConfig(w http.ResponseWriter, req *http.Request) { // TODO: pull configuration from database redisTmpl.Execute(w, redisConfigOpts{ Save: true, MasterIP: "192.168.25.100", MasterPort: "6379", }) } type redisConfigOpts struct { Save bool // Should redis save db to file? MasterIP string // IP address of the redis master MasterPort string // Port of the redis master } var redisTmpl *template.Template const redisString = ` {{if .Save}} save 900 1 save 300 10 save 60 10000 {{end}} slaveof {{.MasterIP}} {{.MasterPort}} ` For our example, we simply statically configure the values, but it is easy to see how we could query etcd or another database to fill in the appropriate values on demand. Now, just go and build and run the config server, and we are ready to implement our configURL-based configuration. How to do it... By design, CoreOS is a very stripped down OS. However, one of the tools it does come with is curl, which we can use to download our configuration. All we have to do is add it to our systemd/fleet unit file. For the redis-slave.service input the following: [Unit] Description=Redis slave server After=docker.service [Service] ExecStartPre=/usr/bin/mkdir -p /tmp/config/redis-slave ExecStartPre=/usr/bin/curl -s -o /tmp/config/redis-slave/redis.conf http://configserver-address:8080/config/redis ExecStartPre=-/usr/bin/docker kill %p ExecStartPre=-/usr/bin/docker rm %p ExecStart=/usr/bin/docker run --rm --name %p -v /tmp/config/redis-slave/redis.conf:/tmp/redis.conf redis:alpine /tmp/redis.conf We have made the configserver's address configserver-address in the preceding code, so make certain you fill in the appropriate IP for the system running the config server. How it works... We outsource the work of generating the configuration to the web server or beyond. This is a common idiom in modern cluster-oriented systems: many small pieces work together to make the whole. The idea of using a configuration URL is very flexible. In this case, it allows us to use a pre-packaged, official Docker image for an application that has no knowledge of the cluster, in its standard, default setup. While redis is fairly simple, the same concept can be used to generate and supply configurations for almost any legacy application. Translating etcd to configuration files In CoreOS, we have a well-suited database that is evidenced by its name and well suited to configuration (while the name etc is an abbreviation for the Latin et cetera, in common UNIX usage, /etc is where the system configuration is stored). It presents a standard HTTP server, which is easy to access from nearly anything. This makes storing application configuration in etcd a natural choice. The only problem is devising methods of storing the configuration in ways that are sufficiently expressive, flexible, and usable. Getting ready A naive but simple way of using etcd is to simply use it as a key-oriented file store as follows: etcdctl set myconfig $(cat mylocalconfig.conf |base64) etcdctl get myconfig |base64 -d > mylocalconfig.conf However, this method stores the configuration file in the database as a static, opaque blob and store/retrieve. Decoupling the generation from the consumption yields much more flexibility both in adapting configuration content to multiple consumers and producers and scaling out multiple access uses. How to do it... We can store and retrieve an entire configuration blob storage very simply as follows: etcdctl set /redis/config $(cat redis.conf |base64) etcdctl get /redis/config |base64 -d > redis.conf Or we can store more generally-structured data as follows: etcdctl set /redis/config/master 192.168.9.23 etcdctl set /redis/config/loglevel notice etcdctl set /redis/config/dbfile dump.rdb And use it in different ways: REDISMASTER=$(curl -s http://localhost:2379/v2/keys/redis/config/master |jq .node.value) cat <<ENDHERE >/etc/redis.conf slaveof $(curl -s http://localhost:2379/v2/keys/redis/config/master jq .node.value) loglevel $(etcdctl get /redis/config/loglevel) dbfile $(etcdctl get /redis/config/dbfile) ENDHERE Building EnvironmentFiles Environment variables are a popular choice for configuring container executions because nearly anything can read or write them, especially shell scripts. Moreover, they are always ephemeral, and by widely-accepted convention they override configuration file settings. Getting ready Systemd provides an EnvironmentFile directive that can be issued multiple times in a service file. This directive takes the argument of a filename that should contain key=value pairs to be loaded into the execution environment of the ExecStart program. CoreOS provides (in most non-bare metal installations) the file /etc/environment, which is formatted to be included with an EnvironmentFile statement. It typically contains variables describing the public and private IPs of the host. Environment file A common misunderstanding when starting out with Docker is about environment variables. Docker does not inherit the environment variables of the environment that calls docker run. Environment variables that are to be passed to the container must be explicitly stated using the -e option. This can be particularly confounding since systemd units do much the same thing. Therefore, to pass environments into Docker from a systemd unit, you need to define them both in the unit and in the docker run invocation. So this will work as expected: [Service] Environment=TESTVAR=testVal ExecStart=/usr/bin/docker -e TESTVAR=$TESTVAR nginx Whereas this will not: [Service] Environment=TESTVAR=unknowableVal ExecStart=/usr/bin/docker nginx How to do it... We will start by constructing an environment file generator unit. For testapp-env.service use the following: [Unit] Description=EnvironmentFile generator for testapp Before=testapp.service BindsTo=testapp.service [Install] RequiredBy=testapp.service [Service] ExecStart=/bin/sh -c "echo NOW=$(date +'%%'s) >/run/now.env" Type=oneshot RemainAfterExit=yes You may note the odd syntax for the date format. Systemd expands %s internally, so it needs to be escaped to be passed to the shell unmolested. For testapp.service use the following: [Unit] Description=My Amazing test app, configured by EnvironmentFile [Service] EnvironmentFile=/run/now.env ExecStart=/usr/bin/docker run --rm -p 8080:8080 -e NOW=${NOW} ulexus/environmentfile-demo If you are using fleet, you can submit these service files. If you are using raw systemd, you will need to install them into the /etc/systemd/system. Then issue the following: systemctl daemon-reload systemctl enable testapp-env.service systemctl start testapp.service testapp output How it works... The first unit writes the current UNIX timestamp to the file `/run/now.env and the second unit reads that file, parsing its contents into environment variables. We further pass the desired environment variables into the docker execution. Taking apart the first unit, there a number of important components. They are as follows: The Before statement tells systemd that the unit should be started before the main testapp. This is important so that the environment file exists before the service is started. Otherwise the unit will fail because the file does not exist or reads the wrong data if the file is stale. The BindsTo setting tells systemd that the unit should be stopped and started with testapp.service. This makes sure that it is restarted when testapp is restarted, refreshing the environment file. The RequiredBy setting tells systemd that this unit is required by the other unit. By stating the relationship in this manner, it allows the first unit to be separately enabled or disabled without any modification of the first unit. While that wouldn't matter in this case, in cases where the target service is a standard unit file which knows nothing about the helper unit, it allows us to use the add-on without fear of our changes to the official, standard service unit. The Type and RemainAfterExit combination of settings tells systemd to expect that the unit will exit, but to treat the unit as up even after it has exited. This allows the prerequisite to operate even though the unit has exited. In the second unit, the main service, the main thing to note is the EnvironmentFile line. It simply takes a file as an argument. We reference the file that was created (or updated) by the first script. Systemd reads it into the environment for any Exec* statements. Because Docker separates its containers' environments, we do still have to manually pass that variable into the container with the -e flag to docker run. There's more... You might be trying to figure out why we don't combine the units and try to set the environment variable with an ExecStartPre statement. Modifications to the environment from an Exec* statement are isolated from each other's Exec* statements. You can make changes to the environment within an Exec* statement, but those changes will not be carried over to any other Exec* statement. Also, you cannot execute any commands in an Environment or EnvironmentFile statement, nor can they expand any variables themselves. Building an active configuration manager Dynamic systems are, well, dynamic. They will often change while a dependent service is running. In such a case, simple runtime configuration systems as we have discussed thus far are insufficient. We need the ability to tell our dependent services to use the new, changed configuration. For such cases as this, we can implement active configuration management. In an active configuration, some processes monitor the state of dynamic components and notify or restart dependent services with the updated data. Getting ready Much like the active service announcer, we will be building our active configuration manager in Go, so a functional Go development environment is required. To increase readability, we have broken each subroutine into a separate file. How to do it... First, we construct the main routine, as follows: main.go: package main import ( "log" "os" "github.com/coreos/etcd/clientv3" "golang.org/x/net/context" ) var etcdKey = "web:backends" func main() { ctx := context.Background() log.Println("Creating etcd client") c, err := clientv3.NewFromURL(os.Getenv("ETCD_ENDPOINTS")) if err != nil { log.Fatal("Failed to create etcd client:", err) os.Exit(1) } defer c.Close() w := c.Watch(ctx, etcdKey, clientv3.WithPrefix()) for resp := range w { if resp.Canceled { log.Fatal("etcd watcher died") os.Exit(1) } go reconfigure(ctx, c) } } Next, our reconfigure routine, which pulls the current state from etcd, writes the configuration to file, and restarts our service, as follows: reconfigure.go: package main import ( "github.com/coreos/etcd/clientv3" "golang.org/x/net/context" ) // reconfigure haproxy func reconfigure(ctx context.Context, c *clientv3.Client) error { backends, err := get(ctx, c) if err != nil { return err } if err = write(backends); err != nil { return err } return restart() } The reconfigure routine just calls get, write and restart, in sequence. Let's create each of those as follows: get.go: package main import ( "bytes" "github.com/coreos/etcd/clientv3" "golang.org/x/net/context" ) // get the present list of backends func get(ctx context.Context, c *clientv3.Client) ([]string, error) { resp, err := clientv3.NewKV(c).Get(ctx, etcdKey) if err != nil { return nil, err } var backends = []string{} for _, node := range resp.Kvs { if node.Value != nil { v := bytes.NewBuffer(node.Value).String() backends = append(backends, v) } } return backends, nil } write.go: package main import ( "html/template" "os" ) var configTemplate *template.Template func init() { configTemplate = template.Must(template.New("config").Parse(configTemplateString)) } // Write the updated config file func write(backends []string) error { cf, err := os.Create("/config/haproxy.conf") if err != nil { return err } defer cf.Close() return configTemplate.Execute(cf, backends) } var configTemplateString = ` frontend public bind 0.0.0.0:80 default_backend servers backend servers {{range $index, $ip := .}} server srv-$index $ip {{end}} ` restart.go: package main import "github.com/coreos/go-systemd/dbus" // restart haproxy func restart() error { conn, err := dbus.NewSystemdConnection() if err != nil { return err } _, err = conn.RestartUnit("haproxy.service", "ignore-dependencies", nil) return err } With our active configuration manager available, we can now create a service unit to run it, as follows: haproxy-config-manager.service: [Unit] Description=Active configuration manager [Service] ExecStart=/usr/bin/docker run --rm --name %p -v /data/config:/data -v /var/run/dbus:/var/run/dbus -v /run/systemd:/run/systemd -e ETCD_ENDPOINTS=http://${COREOS_PUBLIC_IPV4}:2379 quay.io/ulexus/demo-active-configuration-manager Restart=always RestartSec=10 [X-Fleet] MachineOf=haproxy.service How it works... First, we monitor the pertinent keys in etcd. It helps to have all of the keys under one prefix, but if that isn't the case, we can simply add more watchers. When a change occurs, we pull the present values for all the pertinent keys from etcd and then rebuild our configuration file. Next, we tell systemd to restart the dependent service. If the target service has a valid ExecReload, we could tell systemd to reload, instead. In order to talk to systemd, we have passed in the dbus and systemd directories, to enable access to their respective sockets. Using fleet globals When you have a set of services that should be run on each of a set of machines, it can be tedious to run discrete and separate unit instances for each node. Fleet provides a reasonably flexible way to run these kinds of services, and when nodes are added, it will automatically start any declared globals on these machines. Getting ready In order to use fleet globals, you will need fleet running on each machine on which the globals will be executed. This is usually a simple matter of enabling fleet within the cloud-config as follows: #cloud-config coreos: fleet: metadata: service=nginx,cpu=i7,disk=ssd public-ip: "$public_ipv4" units: - name: fleet.service command: start How to do it... To make a fleet unit a global, simply declare the Global=true parameter in the [X-Fleet] section of the unit as follows: [Unit] Description=My global service [Service] ExecStart=/usr/bin/docker run --rm -p 8080:80 nginx [X-Fleet] Global=true Globals can also be filtered with other keys. For instance, a common filter is to run globals on all nodes that have certain metadata: [Unit] Description=My partial global service [Service] ExecStart=/usr/bin/docker run --rm -p 8080:80 nginx [X-Fleet] Global=true MachineMetadata=service=nginx Note that the metadata that is being referred to here is the fleet metadata, which is distinct from the instance metadata of your cloud provider or even the node tags of Kubernetes. How it works... Unlike most fleet units, there is not a one-to-one correspondence between the fleet unit instance and the actual running services. This has the side effect that modifications to a fleet global have immediate global effect. In other words, there is no rolling update with a fleet global. There is an immediate, universal replacement only. Hence, do not use globals for services that cannot be wholly down during upgrades. Summary We overcome the challenges for administrators who comes from traditional static deployment environments. We learned that we can't just build configuration or deploy it. It needs to be proactive in running environment. Any changes needs to be reloaded. Resources for Article: Further resources on this subject: How to Set Up CoreOS Environment [article] CoreOS Networking and Flannel Internals [article] Let's start with Extending Docker [article]
Read more
  • 0
  • 0
  • 2281
article-image-diving-data-search-and-report
Packt
17 Oct 2016
11 min read
Save for later

Diving into Data – Search and Report

Packt
17 Oct 2016
11 min read
In this article by Josh Diakun, Paul R Johnson, and Derek Mock authors of the books Splunk Operational Intelligence Cookbook - Second Edition, we will cover the basic ways to search the data in Splunk. We will cover how to make raw event data readable (For more resources related to this topic, see here.) The ability to search machine data is one of Splunk's core functions, and it should come as no surprise that many other features and functions of Splunk are heavily driven-off searches. Everything from basic reports and dashboards to data models and fully featured Splunk applications are powered by Splunk searches behind the scenes. Splunk has its own search language known as the Search Processing Language (SPL). This SPL contains hundreds of search commands, most of which also have several functions, arguments, and clauses. While a basic understanding of SPL is required in order to effectively search your data in Splunk, you are not expected to know all the commands! Even the most seasoned ninjas do not know all the commands and regularly refer to the Splunk manuals, website, or Splunk Answers (http://answers.splunk.com). To get you on your way with SPL, be sure to check out the search command cheat sheet and download the handy quick reference guide available at http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/SplunkEnterpriseQuickReferenceGuide. Searching Searches in Splunk usually start with a base search, followed by a number of commands that are delimited by one or more pipe (|) characters. The result of a command or search to the left of the pipe is used as the input for the next command to the right of the pipe. Multiple pipes are often found in a Splunk search to continually refine data results as needed. As we go through this article, this concept will become very familiar to you. Splunk allows you to search for anything that might be found in your log data. For example, the most basic search in Splunk might be a search for a keyword such as error or an IP address such as 10.10.12.150. However, searching for a single word or IP over the terabytes of data that might potentially be in Splunk is not very efficient. Therefore, we can use the SPL and a number of Splunk commands to really refine our searches. The more refined and granular the search, the faster the time to run and the quicker you get to the data you are looking for! When searching in Splunk, try to filter as much as possible before the first pipe (|) character, as this will save CPU and disk I/O. Also, pick your time range wisely. Often, it helps to run the search over a small time range when testing it and then extend the range once the search provides what you need. Boolean operators There are three different types of Boolean operators available in Splunk. These are AND, OR, and NOT. Case sensitivity is important here, and these operators must be in uppercase to be recognized by Splunk. The AND operator is implied by default and is not needed, but does no harm if used. For example, searching for the term error or success would return all the events that contain either the word error or the word success. Searching for error success would return all the events that contain the words error and success. Another way to write this can be error AND success. Searching web access logs for error OR success NOT mozilla would return all the events that contain either the word error or success, but not those events that also contain the word mozilla. Common commands There are many commands in Splunk that you will likely use on a daily basis when searching data within Splunk. These common commands are outlined in the following table: Command Description chart/timechart This command outputs results in a tabular and/or time-based output for use by Splunk charts. dedup This command de-duplicates results based upon specified fields, keeping the most recent match. eval This command evaluates new or existing fields and values. There are many different functions available for eval. fields This command specifies the fields to keep or remove in search results. head This command keeps the first X (as specified) rows of results. lookup This command looks up fields against an external source or list, to return additional field values. rare This command identifies the least common values of a field. rename This command renames the fields. replace This command replaces the values of fields with another value. search This command permits subsequent searching and filtering of results. sort This command sorts results in either ascending or descending order. stats This command performs statistical operations on the results. There are many different functions available for stats. table This command formats the results into a tabular output. tail This command keeps only the last X (as specified) rows of results. top This command identifies the most common values of a field. transaction This command merges events into a single event based upon a common transaction identifier. Time modifiers The drop-down time range picker in the Graphical User Interface (GUI) to the right of the Splunk search bar allows users to select from a number of different preset and custom time ranges. However, in addition to using the GUI, you can also specify time ranges directly in your search string using the earliest and latest time modifiers. When a time modifier is used in this way, it automatically overrides any time range that might be set in the GUI time range picker. The earliest and latest time modifiers can accept a number of different time units: seconds (s), minutes (m), hours (h), days (d), weeks (w), months (mon), quarters (q), and years (y). Time modifiers can also make use of the @ symbol to round down and snap to a specified time. For example, searching for sourcetype=access_combined earliest=-1d@d latest=-1h will search all the access_combined events from midnight, a day ago until an hour ago from now. Note that the snap (@) will round down such that if it were 12 p.m. now, we would be searching from midnight a day and a half ago until 11 a.m. today. Working with fields Fields in Splunk can be thought of as keywords that have one or more values. These fields are fully searchable by Splunk. At a minimum, every data source that comes into Splunk will have the source, host, index, and sourcetype fields, but some source might have hundreds of additional fields. If the raw log data contains key-value pairs or is in a structured format such as JSON or XML, then Splunk will automatically extract the fields and make them searchable. Splunk can also be told how to extract fields from the raw log data in the backend props.conf and transforms.conf configuration files. Searching for specific field values is simple. For example, sourcetype=access_combined status!=200 will search for events with a sourcetype field value of access_combined that has a status field with a value other than 200. Splunk has a number of built-in pre-trained sourcetypes that ship with Splunk Enterprise that might work with out-of-the-box, common data sources. These are available at http://docs.splunk.com/Documentation/Splunk/latest/Data/Listofpretrainedsourcetypes. In addition, Technical Add-Ons (TAs), which contain event types and field extractions for many other common data sources such as Windows events, are available from the Splunk app store at https://splunkbase.splunk.com. Saving searches Once you have written a nice search in Splunk, you may wish to save the search so that you can use it again at a later date or use it for a dashboard. Saved searches in Splunk are known as Reports. To save a search in Splunk, you simply click on the Save As button on the top right-hand side of the main search bar and select Report. Making raw event data readable When a basic search is executed in Splunk from the search bar, the search results are displayed in a raw event format by default. To many users, this raw event information is not particularly readable, and valuable information is often clouded by other less valuable data within the event. Additionally, if the events span several lines, only a few events can be seen on the screen at any one time. In this recipe, we will write a Splunk search to demonstrate how we can leverage Splunk commands to make raw event data readable, tabulating events and displaying only the fields we are interested in. Getting ready You should be familiar with the Splunk search bar and search results area. How to do it… Follow the given steps to search and tabulate the selected event data: Log in to your Splunk server. Select the Search & Reporting application from the drop-down menu located in the top left-hand side of the screen. Set the time range picker to Last 24 hours and type the following search into the Splunk search bar: index=main sourcetype=access_combined Then, click on Search or hit Enter. Splunk will return the results of the search and display the raw search events under the search bar. Let's rerun the search, but this time we will add the table command as follows: index=main sourcetype=access_combined | table _time, referer_domain, method, uri_path, status, JSESSIONID, useragent Splunk will now return the same number of events, but instead of presenting the raw events to you, the data will be in a nicely formatted table, displaying only the fields we specified. This is much easier to read! Save this search by clicking on Save As and then on Report. Give the report the name cp02_tabulated_webaccess_logs and click on Save. On the next screen, click on Continue Editing to return to the search. How it works… Let's break down the search piece by piece: Search fragment Description index=main All the data in Splunk is held in one or more indexes. While not strictly necessary, it is a good practice to specify the index (es) to search, as this will ensure a more precise search. sourcetype=access_combined This tells Splunk to search only the data associated with the access_combined sourcetype, which, in our case, is the web access logs. | table _time, referer_domain, method, uri_path, action, JSESSIONID, useragent Using the table command, we take the result of our search to the left of the pipe and tell Splunk to return the data in a tabular format. Splunk will only display the fields specified after the table command in the table of results.  In this recipe, you used the table command. The table command can have a noticeable performance impact on large searches. It should be used towards the end of a search, once all the other processing on the data by the other Splunk commands has been performed. The stats command is more efficient than the table command and should be used in place of table where possible. However, be aware that stats and table are two very different commands. There's more… The table command is very useful in situations where we wish to present data in a readable format. Additionally, tabulated data in Splunk can be downloaded as a CSV file, which many users find useful for offline processing in spreadsheet software or for sending to others. There are some other ways we can leverage the table command to make our raw event data readable. Tabulating every field Often, there are situations where we want to present every event within the data in a tabular format, without having to specify each field one by one. To do this, we simply use a wildcard (*) character as follows: index=main sourcetype=access_combined | table * Removing fields, then tabulating everything else While tabulating every field using the wildcard (*) character is useful, you will notice that there are a number of Splunk internal fields, such as _raw, that appear in the table. We can use the fields command before the table command to remove the fields as follows: index=main sourcetype=access_combined | fields - sourcetype, index, _raw, source date* linecount punct host time* eventtype | table * If we do not include the minus (-) character after the fields command, Splunk will keep the specified fields and remove all the other fields. Summary In this article we covered along with the introduction to Splunk, how to make raw event data readable Resources for Article: Further resources on this subject: Splunk's Input Methods and Data Feeds [Article] The Splunk Interface [Article] The Splunk Web Framework [Article]
Read more
  • 0
  • 0
  • 1170

article-image-how-build-desktop-app-using-electron
Amit Kothari
17 Oct 2016
9 min read
Save for later

How to build a desktop app using Electron

Amit Kothari
17 Oct 2016
9 min read
Desktop apps are making a comeback. Even companies with cloud-based applications with awesome web apps are investing in desktop apps to offer a better user experience. One example is team collaboration tool called Slack. They built a really good desktop app with web technologies using Electron. Electron is an open source framework used to build cross-platform desktop apps using web technologies. It uses Node.js and Chromium and allows us to develop desktop GUI apps using HTML, CSS and JavaScript. Electron is developed by GitHub, initially for Atom editor but now used by many companies, including Slack, Wordpress, Microsoft and Docker to name a few. Electron apps are web apps running in embedded Chromium web browser, with access to the full suite of Node.js modules and underlying operating system. In this post we will build a simple desktop app using Electron. Hello Electron Let’s start by creating a simple app. Before we start, we need Node.js and npm installed. Follow the instructions on the Node.js website if you do not have these installed already. Create a new director for your application and inside the app directory, create a package.json file by using the npm init command. Follow the prompts and remember to set main.js as the entry point. Once the file is generated, install electron-prebuild, which is the precomplied version of electron, and add it as a dev depenency in the package.json using the command npm install --save-dev electron-prebuilt. Also add "start": "electron ." under scripts, which we will use later to start our app. The package.json file will look something like this: { "name": "electron-tutorial", "version": "1.0.0", "description": "Electron Tutorial ", "main": "main.js", "scripts": { "start": "electron ." }, "devDependencies": { "electron-prebuilt": "^1.3.3" } } Create a file main.js with the following content: const {app, BrowserWindow} = require('electron'); // Global reference of the window object. let mainWindow; // When Electron finish initialization, create window and load app index.html app.on('ready', () => { mainWindow = new BrowserWindow({ width: 800, height: 600 }); mainWindow.loadURL(`file://${__dirname}/index.html`); }); We defined main.js as the entry point to our app in package.json. In main.js the electron app module controls the application lifecyle and BrowserWindow is used to create a native browser window. When Electron finishes initializing and our app is ready, we create a browser window to load our web page—index.html. As mentioned in the Electron documentation, remember to keep a global reference of the window object to avoid it from closing automatically when the JavaScript garbage collector kicks in. Finally, create the index.html file: <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Hello Electron</title> </head> <body> <h1>Hello Electron</h1> </body> </html> We can now start our app by running the npm start command. Testing the Electron app Let’s write some integration tests for our app using Spectron. spectron allows us to test Electron apps using ChromeDriver and WebdriverIO. It is a test framework that is agnostic, but for this example, we will use mocha to write the tests. Let’s start by adding spectron and mocha as dev dependecies using the npm install --save-dev spectron and npm install --save-dev mocha commands. Then add "test": "./node_modules/mocha/bin/mocha" under scripts in the package.json file. This will be used to run our tests later. The package.json should look something like this: { "name": "electron-tutorial", "version": "1.0.0", "description": "Electron Tutorial ", "main": "main.js", "scripts": { "start": "electron .", "test": "./node_modules/mocha/bin/mocha" }, "devDependencies": { "electron-prebuilt": "^1.3.3", "mocha": "^3.0.2", "spectron": "^3.3.0" } } Now that we have all the dependencies installed, let’s write some tests. Create a directory called test and a file called test.js inside it. Copy the following content to test.js: var Application = require('spectron').Application; var electron = require('electron-prebuilt'); var assert = require('assert'); describe('Sample app', function () { var app; beforeEach(function () { app = new Application({ path: electron, args: ['.'] }); return app.start(); }); afterEach(function () { if (app && app.isRunning()) { return app.stop(); } }); it('should show initial window', function () { return app.browserWindow.isVisible() .then(function (isVisible) { assert.equal(isVisible, true); }); }); it('should have correct app title', function () { return app.client.getTitle() .then(function (title) { assert.equal(title, 'Hello Electron') }); }); }); Here we have couple of simple tests. We start the app before each test and stop after each test. The first test is to verify that the app's browserWindow is visible, and the second test is to verify the app’s title. We can run these tests using the npm run test command. spectron not only allows us to easily set up and tear down our app, but also give access to various APIs, allowing us to write sophisticated tests covering various business requirements. Please have a look at their documentation for more details. Packaging our app Now that we have a basic app, we are ready to package and build it for distribution. We will use electron-builder for this, which offers a complete solution to distribute apps on different platforms with the option to auto-update. It is recommended to use two separate package.jsons when using electron-builder, one for the development environment and build scripts and another one with app dependencies. But for our simple app, we can just use one package.json file. Let’s start by adding electron-builder as dev dependency using command npm install --save-dev electron-builder. Make sure you have the name, desciption, version and author defined in package.json. You also need to add electron-builder-specific options as build property in package.json: "build": { "appId": "com.amitkothari.electronsample", "category": "public.app-category.productivity" } For Mac OS, we need to specify appId and category. Look at the documentation for options for other platforms. Finally add script in package.json to package and build the app: "dist": "build" The updated package.json will look like this: { "name": "electron-tutorial", "version": "1.0.0", "description": "Electron Tutorial ", "author": "Amit Kothari", "main": "main.js", "scripts": { "start": "electron .", "test": "./node_modules/mocha/bin/mocha", "dist": "build" }, "devDependencies": { "electron-prebuilt": "^1.3.3", "mocha": "^3.0.2", "spectron": "^3.3.0", "electron-builder": "^5.25.1" }, "build": { "appId": "com.amitkothari.electronsample", "category": "public.app-category.productivity" } } Next we need to create a build directory under our project root directory. In this, put a file background.png for the Mac OS DMG background and icon.icns for app icon. We can now package our app by running the npm run dist command. Todo App We’ve built a very simple app, but Electron apps can do more than just show static text. Lets add some dynamic behavior to our app and convert it into a Todo list manager. We can use any JavaScript framework of choice, from AngularJS to React, with Electron, but for this example, we will use plain JavaScript. To start with, let’s update our index.html to display a todo list: <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Hello Electron</title> <link rel="stylesheet" type="text/css" href="./style.css"> </head> <body> <div class="container"> <ul id="todoList"></ul> <textarea id="todoInput" placeholder="What needs to be done ?"></textarea> <button id="addTodoButton">Add to list</button> </div> </body> <script>require('./app.js')</script> </html> We also included style.css and app.js in index.html. All our CSS will be in style.css and our app logic will be in app.js. Create the style.css file with the following content: body { margin: 0; } ul { list-style-type: none; margin: 0; padding: 0; } li { padding: 10px; border-bottom: 1px solid #ddd; } button { background-color: black; color: #fff; margin: 5px; padding: 5px; cursor: pointer; border: none; font-size: 12px; } .container { width: 100%; } #todoInput { float: left; display: block; overflow: auto; margin: 15px; padding: 10px; font-size: 12px; width: 250px; } #addTodoButton { float: left; margin: 25px 10px; } And finally create the app.js file: (function () { const addTodoButton = document.getElementById('addTodoButton'); const todoList = document.getElementById('todoList'); // Create delete button for todo item const createTodoDeleteButton = () => { const deleteButton = document.createElement("button"); deleteButton.innerHTML = "X"; deleteButton.onclick = function () { this.parentNode.outerHTML = ""; }; return deleteButton; } // Create element to show todo text const createTodoText = (todo) => { const todoText = document.createElement("span"); todoText.innerHTML = todo; return todoText; } // Create a todo item with delete button and text const createTodoItem = (todo) => { const todoItem = document.createElement("li"); todoItem.appendChild(createTodoDeleteButton()); todoItem.appendChild(createTodoText(todo)); return todoItem; } // Clear input field const clearTodoInputField = () => { document.getElementById("todoInput").value = ""; } // Add new todo item and clear input field const addTodoItem = () => { const todo = document.getElementById('todoInput').value; if (todo) { todoList.appendChild(createTodoItem(todo)); clearTodoInputField(); } } addTodoButton.addEventListener("click", addTodoItem, false); } ()); Our app.js has a self invoking function which registers a listener (addTodoItem) on addTodoButton click event. On add button click event, the addTodoItem function will add a new todo item and clear the text area. Run the app again using the npm start command. Conclusion We built a very simple app, but it shows the potential of Electron. As stated on the Electron website, if you can build a website, you can build a desktop app. I hope you find this post interesting. If you have built an application with Electron, please share it with us. About the author Amit Kothari is a full-stack software developer based in Melbourne, Australia. He has 10+ years experience in designing and implementing software, mainly in Java/JEE. His recent experience is in building web applications using JavaScript frameworks such as React and AngularJS and backend microservices/REST API in Java. He is passionate about lean software development and continuous delivery.
Read more
  • 0
  • 0
  • 42912

article-image-first-projects-esp8266
Packt
17 Oct 2016
9 min read
Save for later

First Projects with the ESP8266

Packt
17 Oct 2016
9 min read
In this article by Marco Schwartz, author Internet of Things with ESP8266, we will focus on the ESP8266 chip is ready to be used and you can connect it to your Wi-Fi network, we can now build some basic projects with it. This will help you understand the basics of the ESP8266. (For more resources related to this topic, see here.) We are going to see three projects in this article: how to control an LED, how to read data from a GPIO pin, and how to grab the contents from a web page. We will also see how to read data from a digital sensor. Controlling an LED First, we are going to see how to control a simple LED. Indeed, the GPIO pins of the ESP8266 can be configured to realize many functions: inputs, outputs, PWM outputs, and also SPI or I2C communications. This first project will teach you how to use the GPIO pins of the chip as outputs. The first step is to add an LED to our project. These are the extra components you will need for this project: 5mm LED (https://www.sparkfun.com/products/9590) 330 Ohm resistor (to limit the current in the LED) (https://www.sparkfun.com/products/8377) The next step is to connect the LED with the resistor to the ESP8266 board. To do so, the first thing to do is to place the resistor on the breadboard. Then, place the LED on the breadboard as well, connecting the longest pin of the LED (the anode) to one pin of the resistor. Then, connect the other end of the resistor to the GPIO pin 5 of the ESP8266, and the other end of the LED to the ground. This is how it should look like at the end: We are now going to light up the LED by programming the ESP8266 chip, by connecting it to the Wi-Fi network. This is the complete code for this section: // Import required libraries #include <ESP8266WiFi.h> void setup() { // Set GPIO 5 as output pinMode(5, OUTPUT); // Set GPIO 5 on a HIGH state digitalWrite(5, HIGH); } void loop() { } This code simply sets the GPIO pin as an output, and then applies a HIGH state on it. The HIGH state means that the pin is active, and that positive voltage (3.3V) is applied on the pin. A LOW state would mean that the output is at 0V. You can now copy this code and paste it in the Arduino IDE. Then, upload the code to the board using the instructions from the previous article. You should immediately see that the LED is lighting up. You can shut it down again by using digitalWrite(5, LOW) in the code. You could also, for example, modify the code so the ESP8266 switches the LED on and off every second. Reading data from a GPIO pin As a second project in this article, we are going to read the state of a GPIO pin. For this, we will use the same pin as in the previous project. You can therefore remove the LED and the resistor that we used in the previous project. Now, simply connect this pin (GPIO 5) of the board to the positive power supply on your breadboard with a wire, therefore applying a 3.3V signal on this pin. Reading data from a pin is really simple. This is the complete code for this part: // Import required libraries #include <ESP8266WiFi.h> void setup(void) { // Start Serial (to display results on the Serial monitor) Serial.begin(115200); // Set GPIO 5 as input pinMode(5, INPUT);} void loop() { // Read GPIO 5 and print it on Serial port Serial.print("State of GPIO 5: "); Serial.println(digitalRead(5)); // Wait 1 second delay(1000); } We simply set the pin as an input, and then read the value of this pin, and print it out every second. Copy and paste this code into the Arduino IDE, then upload it to the board using the instructions from the previous article. This is the result you should get in the Serial monitor: State of GPIO 5: 1 We can see that the returned value is 1 (digital state HIGH), which is what we expected, because we connected the pin to the positive power supply. As a test, you can also connect the pin to the ground, and the state should go to 0. Grabbing the content from a web page As a last project in this article, we are finally going to use the Wi-Fi connection of the chip to grab the content of a page. We will simply use the www.example.com page, as it's a basic page largely used for test purposes. This is the complete code for this project: // Import required libraries #include <ESP8266WiFi.h> // WiFi parameters constchar* ssid = "your_wifi_network"; constchar* password = "your_wifi_password"; // Host constchar* host = "www.example.com"; void setup() { // Start Serial Serial.begin(115200); // We start by connecting to a WiFi network Serial.println(); Serial.println(); Serial.print("Connecting to "); Serial.println(ssid); WiFi.begin(ssid, password); while (WiFi.status() != WL_CONNECTED) { delay(500); Serial.print("."); } Serial.println(""); Serial.println("WiFi connected"); Serial.println("IP address: "); Serial.println(WiFi.localIP()); } int value = 0; void loop() { Serial.print("Connecting to "); Serial.println(host); // Use WiFiClient class to create TCP connections WiFiClient client; const int httpPort = 80; if (!client.connect(host, httpPort)) { Serial.println("connection failed"); return; } // This will send the request to the server client.print(String("GET /") + " HTTP/1.1rn" + "Host: " + host + "rn" + "Connection: closernrn"); delay(10); // Read all the lines of the reply from server and print them to Serial while(client.available()){ String line = client.readStringUntil('r'); Serial.print(line); } Serial.println(); Serial.println("closing connection"); delay(5000); } The code is really basic: we first open a connection to the example.com website, and then send a GET request to grab the content of the page. Using the while(client.available()) code, we also listen for incoming data, and print it all inside the Serial monitor. You can now copy this code and paste it into the Arduino IDE. This is what you should see in the Serial monitor: This is basically the content of the page, in pure HTML code. Reading data from a digital sensor In this last section of this article, we are going to connect a digital sensor to our ESP8266 chip, and read data from it. As an example, we will use a DHT11 sensor that can be used to get ambient temperature and humidity. You will need to get this component for this section, the DHT11 sensor (https://www.adafruit.com/products/386) Let's now connect this sensor to your ESP8266: First, place the sensor on the breadboard. Then, connect the first pin of the sensor to VCC, the second pin to pin #5 of the ESP8266, and the fourth pin of the sensor to GND. This is how it will look like at the end: Note that here I've used another ESP8266 board, the Adafruit ESP8266 breakout board. We will also use the aREST framework in this example, so it's easy for you to access the measurements remotely. aREST is a complete framework to control your ESP8266 boards remotely (including from the cloud), and we are going to use it several times in the article. You can find more information about it at the following URL: http://arest.io/. Let's now configure the board. The code is too long to be inserted here, but I will detail the most important part of it now. It starts by including the required libraries: #include "ESP8266WiFi.h" #include <aREST.h> #include "DHT.h" To install those libraries, simply look for them inside the Arduino IDE library manager. Next, we need to set the pin on which the DHT sensor is connected to: #define DHTPIN 5 #define DHTTYPE DHT11 After that we declare an instance of the DHT sensor: DHT dht(DHTPIN, DHTTYPE, 15); As earlier, you will need to insert your own Wi-Fi name and password inside the code: const char* ssid = "wifi-name"; const char* password = "wifi-pass"; We also define two variables that will hold the measurements of the sensor: float temperature; float humidity; In the setup() function of the sketch, we initialize the sensor: dht.begin(); Still in the setup() function, we expose the variables to the aREST API, so we can access them remotely via Wi-Fi: rest.variable("temperature",&temperature); rest.variable("humidity",&humidity); Finally, in the loop() function, we make the measurements from the sensor: humidity = dht.readHumidity(); temperature = dht.readTemperature(); It's now time to test the project! Simply grab all the code and put it inside the Arduino IDE. Also make sure to install the aREST Arduino library using the Arduino library manager. Now, put the ESP8266 board in bootloader mode, and upload the code to the board. After that, reset the board, and open the Serial monitor. You should see the IP address of the board being displayed: Now, we can access the measurements from the sensor remotely. Simply go to your favorite web browser, and type: 192.168.115.105/temperature You should immediately get the answer from the board, with the temperature being displayed: { "temperature": 25.00, "id": "1", "name": "esp8266", "connected": true } You can of course do the same with humidity. Note that we used here the aREST API. You can learn more about it at: http://arest.io/. Congratulations, you just completed your very first projects using the ESP8266 chip! Feel free to experiment with what you learned in this article, and start learning more about how to configure your ESP8266 chip. Summary In this article, we realized our first basic projects using the ESP8266 Wi-Fi chip. We first learned how to control a simple output, by controlling the state of an LED. Then, we saw how to read the state of a digital pin on the chip. Finally, we learned how to read data from a digital sensor, and actually grab this data using the aREST framework. We are going to go right into the main topic of the article, and build our first Internet of Things project using the ESP8266. Resources for Article: Further resources on this subject: Sending Notifications using Raspberry Pi Zero [article] The Raspberry Pi and Raspbian [article] Working with LED Lamps [article]
Read more
  • 0
  • 0
  • 14940
article-image-learning-how-manage-records-visualforc
Packt
14 Oct 2016
7 min read
Save for later

Learning How to Manage Records in Visualforc

Packt
14 Oct 2016
7 min read
In this article by Keir Bowden, author of the book, Visualforce Development Cookbook - Second Edition we will cover the following styling fields and table columns as per requirement One of the common use cases for Visualforce pages is to simplify, streamline, or enhance the management of sObject records. In this article, we will use Visualforce to carry out some more advanced customization of the user interface—redrawing the form to change available picklist options, or capturing different information based on the user's selections. (For more resources related to this topic, see here.) Styling fields as required Standard Visualforce input components, such as <apex:inputText />, can take an optional required attribute. If set to true, the component will be decorated with a red bar to indicate that it is required, and form submission will fail if a value has not been supplied, as shown in the following screenshot: In the scenario where one or more inputs are required and there are additional validation rules, for example, when one of either the Email or Phone fields is defined for a contact, this can lead to a drip feed of error messages to the user. This is because the inputs make repeated unsuccessful attempts to submit the form, each time getting slightly further in the process. Now, we will create a Visualforce page that allows a user to create a contact record. The Last Name field is captured through a non-required input decorated with a red bar identical to that created for required inputs. When the user submits the form, the controller validates that the Last Name field is populated and that one of the Email or Phone fields is populated. If any of the validations fail, details of all errors are returned to the user. Getting ready This topic makes use of a controller extension so this must be created before the Visualforce page. How to do it… Navigate to the Apex Classes setup page by clicking on Your Name | Setup | Develop | Apex Classes. Click on the New button. Paste the contents of the RequiredStylingExt.cls Apex class from the code downloaded into the Apex Class area. Click on the Save button. Navigate to the Visualforce setup page by clicking on Your Name | Setup | Develop | Visualforce Pages. Click on the New button. Enter RequiredStyling in the Label field. Accept the default RequiredStyling that is automatically generated for the Name field. Paste the contents of the RequiredStyling.page file from the code downloaded into the Visualforce Markup area and click on the Save button. Navigate to the Visualforce setup page by clicking on Your Name | Setup | Develop | Visualforce Pages. Locate the entry for the RequiredStyling page and click on the Security link. On the resulting page, select which profiles should have access and click on the Save button. How it works… Opening the following URL in your browser displays the RequiredStyling page to create a new contact record: https://<instance>/apex/RequiredStyling. Here, <instance> is the Salesforce instance specific to your organization, for example, na6.salesforce.com. Clicking on the Save button without populating any of the fields results in the save failing with a number of errors: The Last Name field is constructed from a label and text input component rather than a standard input field, as an input field would enforce the required nature of the field and stop the submission of the form: <apex:pageBlockSectionItem > <apex:outputLabel value="Last Name"/> <apex:outputPanel id="detailrequiredpanel" layout="block" styleClass="requiredInput"> <apex:outputPanel layout="block" styleClass="requiredBlock" /> <apex:inputText value="{!Contact.LastName}"/> </apex:outputPanel> </apex:pageBlockSectionItem> The required styles are defined in the Visualforce page rather than relying on any existing Salesforce style classes to ensure that if Salesforce changes the names of its style classes, this does not break the page. The controller extension save action method carries out validation of all fields and attaches error messages to the page for all validation failures: if (String.IsBlank(cont.name)) { ApexPages.addMessage(new ApexPages.Message( ApexPages.Severity.ERROR, 'Please enter the contact name')); error=true; } if ( (String.IsBlank(cont.Email)) && (String.IsBlank(cont.Phone)) ) { ApexPages.addMessage(new ApexPages.Message( ApexPages.Severity.ERROR, 'Please supply the email address or phone number')); error=true; } Styling table columns as required When maintaining records that have required fields through a table, using regular input fields can end up with an unsightly collection of red bars striped across the table. Now, we will create a Visualforce page to allow a user to create a number of contact records via a table. The contact Last Name column header will be marked as required, rather than the individual inputs. Getting ready This topic makes use of a custom controller, so this will need to be created before the Visualforce page. How to do it… First, create the custom controller by navigating to the Apex Classes setup page by clicking on Your Name | Setup | Develop | Apex Classes. Click on the New button. Paste the contents of the RequiredColumnController.cls Apex class from the code downloaded into the Apex Class area. Click on the Save button. Next, create a Visualforce page by navigating to the Visualforce setup page by clicking on Your Name | Setup | Develop | Visualforce Pages. Click on the New button. Enter RequiredColumn in the Label field. Accept the default RequiredColumn that is automatically generated for the Name field. Paste the contents of the RequiredColumn.page file from the code downloaded into the Visualforce Markup area and click on the Save button. Navigate to the Visualforce setup page by clicking on Your Name | Setup | Develop | Visualforce Pages. Locate the entry for the RequiredColumn page and click on the Security link. On the resulting page, select which profiles should have access and click on the Save button. How it works… Opening the following URL in your browser displays the RequiredColumn page: https://<instance>/apex/RequiredColumn. Here, <instance> is the Salesforce instance specific to your organization, for example, na6.salesforce.com. The Last Name column header is styled in red, indicating that this is a required field. Attempting to create a record where only First Name is specified results in an error message being displayed against the Last Name input for the particular row: The Visualforce page sets the required attribute on the inputField components in the Last Name column to false, which removes the red bar from the component: <apex:column > <apex:facet name="header"> <apex:outputText styleclass="requiredHeader" value="{!$ObjectType.Contact.fields.LastName.label}" /> </apex:facet> <apex:inputField value="{!contact.LastName}" required="false"/> </apex:column> The Visualforce page custom controller Save method checks if any of the fields in the row are populated, and if this is the case, it checks that the last name is present. If the last name is missing from any record, an error is added. If an error is added to any record, the save does not complete: if ( (!String.IsBlank(cont.FirstName)) || (!String.IsBlank(cont.LastName)) ) { // a field is defined - check for last name if (String.IsBlank(cont.LastName)) { error=true; cont.LastName.addError('Please enter a value'); } String.IsBlank() is used as this carries out three checks at once: to check that the supplied string is not null, it is not empty, and it does not only contain whitespace. Summary Thus in this article we successfully mastered the techniques to style fields and table columns as per the custom needs. Resources for Article: Further resources on this subject: Custom Components in Visualforce [Article] Visualforce Development with Apex [Article] Learning How to Manage Records in Visualforce [Article]
Read more
  • 0
  • 0
  • 9223

article-image-fast-data-manipulation-r
Packt
14 Oct 2016
28 min read
Save for later

Fast Data Manipulation with R

Packt
14 Oct 2016
28 min read
Data analysis is a combination of art and science. The art part consists of data exploration and visualization, which is usually done best with better intuition and understanding of the data. The science part consists of statistical analysis, which relies on concrete knowledge of statistics and analytic skills. However, both parts of a serious research require proper tools and good skills to work with them. R is exactly the proper tool to do data analysis with. In this article by Kun Ren, author of the book Learning R Programming, we will discuss how R and data.table package make it easy to transform data and, thus, greatly unleash our productivity. (For more resources related to this topic, see here.) Loading data as data frames The most basic data structures in R are atomic vectors, such as. numeric, logical, character, and complex vector, and list. An atomic vector stores elements of the same type while list is allowed to store different types of elements. The most commonly used data structure in R to store real-world data is data frame. A data frame stores data in tabular form. In essence, a data frame is a list of vectors with equal length but maybe different types. Most of the code in this article is based on a group of fictitious data about some products (you can download the data at https://gist.github.com/renkun-ken/ba2d33f21efded23db66a68240c20c92). We will use the readr package to load the data for better handling of column types. If you don't have this package installed, please run install.packages("readr"). library(readr) product_info <- read_csv("data/product-info.csv") product_info ##    id      name  type   class released ## 1 T01    SupCar   toy vehicle      yes ## 2 T02  SupPlane   toy vehicle       no ## 3 M01     JeepX model vehicle      yes ## 4 M02 AircraftX model vehicle      yes ## 5 M03    Runner model  people      yes ## 6 M04    Dancer model  people       no Once the data is loaded into memory as a data frame, we can take a look at its column types, shown as follows: sapply(product_info, class) ##          id        name        type       class    released ## "character" "character" "character" "character" "character" Using built-in functions to manipulate data frames Although a data frame is essentially a list of vectors, we can access it like a matrix due to all column vectors being the same length. To select rows that meet certain conditions, we will supply a logical vector as the first argument of [] while the second is left empty. For example, we can take out all rows of toy type, shown as follows: product_info[product_info$type == "toy", ] ##    id     name type   class released ## 1 T01   SupCar  toy vehicle      yes ## 2 T02 SupPlane  toy vehicle       no Or, we can take out all rows that are not released. product_info[product_info$released == "no", ] ##    id     name  type   class released ## 2 T02 SupPlane   toy vehicle       no ## 6 M04   Dancer model  people       no To filter columns, we can supply a character vector as the second argument while the first is left empty, which is exactly the same with how we subset a matrix. product_info[1:3, c("id", "name", "type")] ##    id     name  type ## 1 T01   SupCar   toy ## 2 T02 SupPlane   toy ## 3 M01    JeepX model Alternatively, we can filter the data frame by regarding it as a list. We can supply only one character vector of column names in []. product_info[c("id", "name", "class")] ##    id      name   class ## 1 T01    SupCar vehicle ## 2 T02  SupPlane vehicle ## 3 M01     JeepX vehicle ## 4 M02 AircraftX vehicle ## 5 M03    Runner  people ## 6 M04    Dancer  people To filter a data frame by both row and column, we can supply a vector as the first argument to select rows and a vector as the second to select columns. product_info[product_info$type == "toy", c("name", "class", "released")] ##       name   class released ## 1   SupCar vehicle      yes ## 2 SupPlane vehicle       no If the row filtering condition is based on values of certain columns, the preceding code can be very redundant, especially when the condition gets more complicated. Another built-in function to simplify code is subset, as introduced previously. subset(product_info,   subset = type == "model" & released == "yes",   select = name:class) ##        name  type   class ## 3     JeepX model vehicle ## 4 AircraftX model vehicle ## 5    Runner model  people The subset function uses nonstandard evaluation so that we can directly use the columns of the data frame without typing product_info many times because the expressions are meant to be evaluated in the context of the data frame. Similarly, we can use with to evaluate an expression in the context of the data frame, that is, the columns of the data frame can be used as symbols in the expression without repeatedly specifying the data frame. with(product_info, name[released == "no"]) ## [1] "SupPlane" "Dancer" The expression can be more than a simple subsetting. We can summarize the data by counting the occurrences of each possible value of a vector. For example, we can create a table of occurrences of types of records that are released. with(product_info, table(type[released == "yes"])) ## ## model   toy ##     3     1 In addition to the table of product information, we also have a table of product statistics that describe some properties of each product. product_stats <- read_csv("data/product-stats.csv") product_stats ##    id material size weight ## 1 T01    Metal  120   10.0 ## 2 T02    Metal  350   45.0 ## 3 M01 Plastics   50     NA ## 4 M02 Plastics   85    3.0 ## 5 M03     Wood   15     NA ## 6 M04     Wood   16    0.6 Now, think of how we can get the names of products with the top three largest sizes? One way is to sort the records in product_stats by size in descending order, select id values of the top three records, and use these values to filter rows of product_info by id. top_3_id <- product_stats[order(product_stats$size, decreasing = TRUE), "id"][1:3] product_info[product_info$id %in% top_3_id, ] ##    id      name  type   class released ## 1 T01    SupCar   toy vehicle      yes ## 2 T02  SupPlane   toy vehicle       no ## 4 M02 AircraftX model vehicle      yes This approach looks quite redundant. Note that product_info and product_stats actually describe the same set of products in different perspectives. The connection between these two tables is the id column. Each id is unique and means the same product. To access both sets of information, we can put the two tables together into one data frame. The simplest way to do this is use merge: product_table <- merge(product_info, product_stats, by = "id") product_table ##    id      name  type   class released material size weight ## 1 M01     JeepX model vehicle      yes Plastics   50     NA ## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0 ## 3 M03    Runner model  people      yes     Wood   15     NA ## 4 M04    Dancer model  people       no     Wood   16    0.6 ## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0 ## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0 Now, we can create a new data frame that is a combined version of product_table and product_info with a shared id column. In fact, if you reorder the records in the second table, the two tables still can be correctly merged. With the combined version, we can do things more easily. For example, with the merged version, we can sort the data frame with any column in one table we loaded without having to manually work with the other. product_table[order(product_table$size), ] ##    id      name  type   class released material size weight ## 3 M03    Runner model  people      yes     Wood   15     NA ## 4 M04    Dancer model  people       no     Wood   16    0.6 ## 1 M01     JeepX model vehicle      yes Plastics   50     NA ## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0 ## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0 ## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0 To solve the problem, we can directly use the merged table and get the same answer. product_table[order(product_table$size, decreasing = TRUE), "name"][1:3] ## [1] "SupPlane"  "SupCar"    "AircraftX" The merged data frame allows us to sort the records by a column in one data frame and filter the records by a column in the other. For example, we can first sort the product records by weight in descending order and select all records of model type. product_table[order(product_table$weight, decreasing = TRUE), ][   product_table$type == "model",] ##    id      name  type   class released material size weight ## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0 ## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0 ## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0 ## 4 M04    Dancer model  people       no     Wood   16    0.6 Sometimes, the column values are literal but can be converted to standard R data structures to better represent the data. For example, released column in product_info only takes yes and no, which can be better represented with a logical vector. We can use <- to modify the column values, as we learned previously. However, it is usually better to create a new data frame with the existing columns properly adjusted and new columns added without polluting the original data. To do this, we can use transform: transform(product_table,   released = ifelse(released == "yes", TRUE, FALSE),   density = weight / size) ##    id      name  type   class released material size weight ## 1 M01     JeepX model vehicle     TRUE Plastics   50     NA ## 2 M02 AircraftX model vehicle     TRUE Plastics   85    3.0 ## 3 M03    Runner model  people     TRUE     Wood   15     NA ## 4 M04    Dancer model  people    FALSE     Wood   16    0.6 ## 5 T01    SupCar   toy vehicle     TRUE    Metal  120   10.0 ## 6 T02  SupPlane   toy vehicle    FALSE    Metal  350   45.0 ##      density ## 1         NA ## 2 0.03529412 ## 3         NA ## 4 0.03750000 ## 5 0.08333333 ## 6 0.12857143 The result is a new data frame with released converted to a logical vector and a new density column added. You can easily verify that product_table is not modified at all. Additionally, note that transform is like subset, as both functions use nonstandard evaluation to allow direct use of data frame columns as symbols in the arguments so that we don't have to type product_table$ all the time. Now, we will load another table into R. It is the test results of the quality, and durability of each product. We store the data in product_tests. product_tests <- read_csv("data/product-tests.csv") product_tests ##    id quality durability waterproof ## 1 T01      NA         10         no ## 2 T02      10          9         no ## 3 M01       6          4        yes ## 4 M02       6          5        yes ## 5 M03       5         NA        yes ## 6 M04       6          6        yes Note that the values in both quality and durability contain missing values (NA). To exclude all rows with missing values, we can use na.omit(): na.omit(product_tests) ##    id quality durability waterproof ## 2 T02      10          9         no ## 3 M01       6          4        yes ## 4 M02       6          5        yes ## 6 M04       6          6        yes Another way is to use complete.cases() to get a logical vector indicating all complete rows, without any missing value,: complete.cases(product_tests) ## [1] FALSE  TRUE  TRUE  TRUE FALSE  TRUE Then, we can use this logical vector to filter the data frame. For example, we can get the id  column of all complete rows as follows: product_tests[complete.cases(product_tests), "id"] ## [1] "T02" "M01" "M02" "M04" Or, we can get the id column of all incomplete rows: product_tests[!complete.cases(product_tests), "id"] ## [1] "T01" "M03" Note that product_info, product_stats and product_tests all share an id column, and we can merge them altogether. Unfortunately, there's no built-in function to merge an arbitrary number of data frames. We can only merge two existing data frames at a time, or we'll have to merge them recursively. merge(product_table, product_tests, by = "id") ##    id      name  type   class released material size weight ## 1 M01     JeepX model vehicle      yes Plastics   50     NA ## 2 M02 AircraftX model vehicle      yes Plastics   85    3.0 ## 3 M03    Runner model  people      yes     Wood   15     NA ## 4 M04    Dancer model  people       no     Wood   16    0.6 ## 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0 ## 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0 ##   quality durability waterproof ## 1       6          4        yes ## 2       6          5        yes ## 3       5         NA        yes ## 4       6          6        yes ## 5      NA         10         no ## 6      10          9         no Data wrangling with data.table In the previous section, we had an overview on how we can use built-in functions to work with data frames. Built-in functions work, but are usually verbose. In this section, let's use data.table, an enhanced version of data.frame, and see how it makes data manipulation much easier. Run install.packages("data.table") to install the package. As long as the package is ready, we can load the package and use fread() to read the data files as data.table objects. library(data.table) product_info <- fread("data/product-info.csv") product_stats <- fread("data/product-stats.csv") product_tests <- fread("data/product-tests.csv") toy_tests <- fread("data/product-toy-tests.csv") It is extremely easy to filter data in data.table. To select the first two rows, just use [1:2], which instead selects the first two columns for data.frame. product_info[1:2] ##     id     name type   class released ## 1: T01   SupCar  toy vehicle      yes ## 2: T02 SupPlane  toy vehicle       no To filter by logical conditions, just directly type columns names as variables without quotation as the expression is evaluated within the context of product_info: product_info[type == "model" & class == "people"] ##     id   name  type  class released ## 1: M03 Runner model people      yes ## 2: M04 Dancer model people       no It is easy to select or transform columns. product_stats[, .(id, material, density = size / weight)] ##     id material   density ## 1: T01    Metal 12.000000 ## 2: T02    Metal  7.777778 ## 3: M01 Plastics        NA ## 4: M02 Plastics 28.333333 ## 5: M03     Wood        NA ## 6: M04     Wood 26.666667 The data.table object also supports using key for subsetting, which can be much faster than using ==. We can set a column as key for each data.table: setkey(product_info, id) setkey(product_stats, id) setkey(product_tests, id) Then, we can use a value to directly select rows. product_info["M02"] ##     id      name  type   class released ## 1: M02 AircraftX model vehicle      yes We can also set multiple columns as key so as to use multiple values to subset it. setkey(toy_tests, id, date) toy_tests[.("T02", 20160303)] ##     id     date sample quality durability ## 1: T02 20160303     75       8          8 If two data.table objects share the same key, we can join them easily: product_info[product_tests] ##     id      name  type   class released quality durability ## 1: M01     JeepX model vehicle      yes       6          4 ## 2: M02 AircraftX model vehicle      yes       6          5 ## 3: M03    Runner model  people      yes       5         NA ## 4: M04    Dancer model  people       no       6          6 ## 5: T01    SupCar   toy vehicle      yes      NA         10 ## 6: T02  SupPlane   toy vehicle       no      10          9 ##    waterproof ## 1:        yes ## 2:        yes ## 3:        yes ## 4:        yes ## 5:         no ## 6:         no Instead of creating new data.table, in-place modification is also supported. The := sets the values of a column in place without the overhead of making copies and, thus, is much faster than using <-. product_info[, released := (released == "yes")] ##     id      name  type   class released ## 1: M01     JeepX model vehicle     TRUE ## 2: M02 AircraftX model vehicle     TRUE ## 3: M03    Runner model  people     TRUE ## 4: M04    Dancer model  people    FALSE ## 5: T01    SupCar   toy vehicle     TRUE ## 6: T02  SupPlane   toy vehicle    FALSE product_info ##     id      name  type   class released ## 1: M01     JeepX model vehicle     TRUE ## 2: M02 AircraftX model vehicle     TRUE ## 3: M03    Runner model  people     TRUE ## 4: M04    Dancer model  people    FALSE ## 5: T01    SupCar   toy vehicle     TRUE ## 6: T02  SupPlane   toy vehicle    FALSE Another important argument of subsetting a data.table is by, which is used to split the data into multiple parts and for each part the second argument (j) is evaluated. For example, the simplest usage of by is counting the records in each group. In the following code, we can count the number of both released and unreleased products: product_info[, .N, by = released] ##    released N ## 1:     TRUE 4 ## 2:    FALSE 2 The group can be defined by more than one variable. For example, a tuple of type and class can be a group, and for each group, we can count the number of records, as follows: product_info[, .N, by = .(type, class)] ##     type   class N ## 1: model vehicle 2 ## 2: model  people 2 ## 3:   toy vehicle 2 We can also perform the following statistical calculations for each group: product_tests[, .(mean_quality = mean(quality, na.rm = TRUE)),   by = .(waterproof)] ##    waterproof mean_quality ## 1:        yes         5.75 ## 2:         no        10.00 We can chain multiple [] in turn. In the following example, we will first join product_info and product_tests by a shared key id and then calculate the mean value of quality and durability for each group of type and class of released products. product_info[product_tests][released == TRUE,   .(mean_quality = mean(quality, na.rm = TRUE),     mean_durability = mean(durability, na.rm = TRUE)),   by = .(type, class)] ##     type   class mean_quality mean_durability ## 1: model vehicle            6             4.5 ## 2: model  people            5             NaN ## 3:   toy vehicle          NaN            10.0 Note that the values of the by columns will be unique in the resulted data.table; we can use keyby instead of by to ensure that it is automatically used as key by the resulted data.table. product_info[product_tests][released == TRUE,   .(mean_quality = mean(quality, na.rm = TRUE),     mean_durability = mean(durability, na.rm = TRUE)),   keyby = .(type, class)] ##     type   class mean_quality mean_durability ## 1: model  people            5             NaN ## 2: model vehicle            6             4.5 ## 3:   toy vehicle          NaN            10.0 The data.table package also provides functions to perform superfast reshaping of data. For example, we can use dcast() to spread id values along the x-axis as columns and align quality values to all possible date values along the y-axis. toy_quality <- dcast(toy_tests, date ~ id, value.var = "quality") toy_quality ##        date T01 T02 ## 1: 20160201   9   7 ## 2: 20160302  10  NA ## 3: 20160303  NA   8 ## 4: 20160403  NA   9 ## 5: 20160405   9  NA ## 6: 20160502   9  10 Although each month a test is conducted for each product, the dates may not exactly match with each other. This results in missing values if one product has a value on a day but the other has no corresponding value on exactly the same day. One way to fix this is to use year-month data instead of exact date. In the following code, we will create a new ym column that is the first 6 characters of toy_tests. For example, substr(20160101, 1, 6) will result in 201601. toy_tests[, ym := substr(toy_tests$date, 1, 6)] ##     id     date sample quality durability     ym ## 1: T01 20160201    100       9          9 201602 ## 2: T01 20160302    150      10          9 201603 ## 3: T01 20160405    180       9         10 201604 ## 4: T01 20160502    140       9          9 201605 ## 5: T02 20160201     70       7          9 201602 ## 6: T02 20160303     75       8          8 201603 ## 7: T02 20160403     90       9          8 201604 ## 8: T02 20160502     85      10          9 201605 toy_tests$ym ## [1] "201602" "201603" "201604" "201605" "201602" "201603" ## [7] "201604" "201605" This time, we will use ym for alignment instead of date: toy_quality <- dcast(toy_tests, ym ~ id, value.var = "quality") toy_quality ##        ym T01 T02 ## 1: 201602   9   7 ## 2: 201603  10   8 ## 3: 201604   9   9 ## 4: 201605   9  10 Now the missing values are gone, the quality scores of both products in each month are naturally presented. Sometimes, we will need to combine a number of columns into one that indicates the measure and another that stores the value. For example, the following code uses melt() to combine the two measures (quality and durability) of the original data into a column named measure and a column of the measured value. toy_tests2 <- melt(toy_tests, id.vars = c("id", "ym"),   measure.vars = c("quality", "durability"),   variable.name = "measure") toy_tests2 ##      id     ym    measure value ##  1: T01 201602    quality     9 ##  2: T01 201603    quality    10 ##  3: T01 201604    quality     9 ##  4: T01 201605    quality     9 ##  5: T02 201602    quality     7 ##  6: T02 201603    quality     8 ##  7: T02 201604    quality     9 ##  8: T02 201605    quality    10 ##  9: T01 201602 durability     9 ## 10: T01 201603 durability     9 ## 11: T01 201604 durability    10 ## 12: T01 201605 durability     9 ## 13: T02 201602 durability     9 ## 14: T02 201603 durability     8 ## 15: T02 201604 durability     8 ## 16: T02 201605 durability     9 The variable names are now contained in the data, which can be directly used by some packages. For example, we can use ggplot2 to plot data in such format. The following code is an example of a scatter plot with a facet grid of different combination of factors. library(ggplot2) ggplot(toy_tests2, aes(x = ym, y = value)) +   geom_point() +   facet_grid(id ~ measure) The graph generated is shown as follows: The plot can be easily manipulated because the grouping factor (measure) is contained as data rather than columns, which is easier to represent from the perspective of the ggplot2 package. ggplot(toy_tests2, aes(x = ym, y = value, color = id)) +   geom_point() +   facet_grid(. ~ measure) The graph generated is shown as follows: Summary In this article, we used both built-in functions and the data.table package to perform simple data manipulation tasks. Using built-in functions can be verbose while using data.table can be much easier and faster. However, the tasks in real-world data analysis can be much more complex than the examples we demonstrated, which also requires better R programming skills. It is helpful to have a good understanding on how nonstandard evaluation makes data.table so easy to work with, how environment works and scoping rules apply to make your code predictable, and so on. A universal and consistent understanding of how R basically works will certainly give you great confidence to write R code to work with data and enable you to learn packages very quickly. Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Getting Started with Bootstrap [article] Basics of Classes and Objects [article]
Read more
  • 0
  • 0
  • 4136
Modal Close icon
Modal Close icon