How-To Tutorials

09 Mar 2017

6 min read

Learn from Data

09 Mar 2017

In this article by Rushdi Shams, the author of the book Java Data Science Cookbook, we will cover recipes that use machine learning techniques to learn patterns from data. These patterns are at the centre of attention for at least three key machine-learning tasks: classification, regression, and clustering. Classification is the task of predicting a value from a nominal class. In contrast to classification, regression models attempt to predict a value from a numeric class. (For more resources related to this topic, see here.) Generating linear regression models Most of the linear regression modelling follows a general pattern—there will be many independent variables that will be collectively produce a result, which is a dependent variable. For instance, we can generate a regression model to predict the price of a house based on different attributes/features of a house (mostly numeric, real values) like its size in square feet, number of bedrooms, number of washrooms, importance of its location, and so on. In this recipe, we will use Weka’s Linear Regression classifier to generate a regression model. Getting ready In order to perform the recipes in this section, we will require the following: To download Weka, go to http://www.cs.waikato.ac.nz/ml/weka/downloading.html and you will find download options for Windows, Mac, and other operating systems such as Linux. Read through the options carefully and download the appropriate version. During the writing of this article, 3.9.0 was the latest version for the developers and as the author already had version 1.8 JVM installed in his 64-bit Windows machine, he has chosen to download a self-extracting executable for 64-bit Windows without a Java Virtual Machine (JVM) After the download is complete, double-click on the executable file and follow on screen instructions. You need to install the full version of Weka. Once the installation is done, do not run the software. Instead, go to the directory where you have installed it and find the Java Archive File for Weka (weka.jar). Add this file in your Eclipse project as external library. If you need to download older versions of Weka for some reasons, all of them can be found at https://sourceforge.net/projects/weka/files/. Please note that there is a possibility that many of the methods from old versions are deprecated and therefore not supported any more. How to do it… In this recipe, the linear regression model we will be creating is based on the cpu.arff dataset that can be found in the data directory of the Weka installation directory. Our code will have two instance variables: the first variable will contain the data instances of cpu.arff file and the second variable will be our linear regression classifier. Instances cpu = null; LinearRegression lReg ; Next, we will be creating a method to load the ARFF file and assign the last attribute of the ARFF file as its class attribute. public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); cpu = source.getDataSet(); cpu.setClassIndex(cpu.numAttributes() - 1); } catch (Exception e1) { } } We will be creating a method to build the linear regression model. To do so, we simply need to call the buildClassifier() method of our linear regression variable. The model can directly be sent as parameter to System.out.println(). public void buildRegression(){ lReg = new LinearRegression(); try { lReg.buildClassifier(cpu); } catch (Exception e) { } System.out.println(lReg); } The complete code for the recipe is as follows: import weka.classifiers.functions.LinearRegression; import weka.core.Instances; import weka.core.converters.ConverterUtils.DataSource; public class WekaLinearRegressionTest { Instances cpu = null; LinearRegression lReg ; public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); cpu = source.getDataSet(); cpu.setClassIndex(cpu.numAttributes() - 1); } catch (Exception e1) { } } public void buildRegression(){ lReg = new LinearRegression(); try { lReg.buildClassifier(cpu); } catch (Exception e) { } System.out.println(lReg); } public static void main(String[] args) throws Exception{ WekaLinearRegressionTest test = new WekaLinearRegressionTest(); test.loadArff("path to the cpu.arff file"); test.buildRegression(); } } The output of the code is as follows: Linear Regression Model class = 0.0491 * MYCT + 0.0152 * MMIN + 0.0056 * MMAX + 0.6298 * CACH + 1.4599 * CHMAX + -56.075 Generating logistic regression models Weka has a class named Logistic that can be used for building and using a multinomial logistic regression model with a ridge estimator. Although original Logistic Regression does not deal with instance weights, the algorithm in Weka has been modified to handle the instance weights. In this recipe, we will use Weka to generate logistic regression model on iris dataset. How to do it… We will be generating a logistic regression model from the iris dataset that can be found in the data directory in the installed folder of Weka. Our code will have two instance variables: one will be containing the data instances of iris dataset and the other will be the logistic regression classifier. Instances iris = null; Logistic logReg ; We will be using a method to load and read the dataset as well as assign its class attribute (the last attribute of iris.arff file): public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); iris = source.getDataSet(); iris.setClassIndex(iris.numAttributes() - 1); } catch (Exception e1) { } } Next, we will be creating the most important method of our recipe that builds a logistic regression classifier from the iris dataset: public void buildRegression(){ logReg = new Logistic(); try { logReg.buildClassifier(iris); } catch (Exception e) { } System.out.println(logReg); } The complete executable code for the recipe is as follows: import weka.classifiers.functions.Logistic; import weka.core.Instances; import weka.core.converters.ConverterUtils.DataSource; public class WekaLogisticRegressionTest { Instances iris = null; Logistic logReg ; public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); iris = source.getDataSet(); iris.setClassIndex(iris.numAttributes() - 1); } catch (Exception e1) { } } public void buildRegression(){ logReg = new Logistic(); try { logReg.buildClassifier(iris); } catch (Exception e) { } System.out.println(logReg); } public static void main(String[] args) throws Exception{ WekaLogisticRegressionTest test = new WekaLogisticRegressionTest(); test.loadArff("path to the iris.arff file "); test.buildRegression(); } } The output of the code is as follows: Logistic Regression with ridge parameter of 1.0E-8 Coefficients... Class Variable Iris-setosa Iris-versicolor =============================================== sepallength 21.8065 2.4652 sepalwidth 4.5648 6.6809 petallength -26.3083 -9.4293 petalwidth -43.887 -18.2859 Intercept 8.1743 42.637 Odds Ratios... Class Variable Iris-setosa Iris-versicolor =============================================== sepallength 2954196659.8892 11.7653 sepalwidth 96.0426 797.0304 petallength 0 0.0001 petalwidth 0 0 The interpretation of the results from the recipe is beyond the scope of this article. Interested readers are encouraged to see a Stack Overflow discussion here: http://stackoverflow.com/questions/19136213/how-to-interpret-weka-logistic-regression-output. Summary In this article, we have covered the recipes that use machine learning techniques to learn patterns from data. These patterns are at the centre of attention for at least three key machine-learning tasks: classification, regression, and clustering. Classification is the task of predicting a value from a nominal class. Resources for Article: Further resources on this subject: The Data Science Venn Diagram [article] Data Science with R [article] Data visualization [article]

0
0
2620

article-image-members-inheritance-and-polymorphism

Packt

09 Mar 2017

16 min read

Members Inheritance and Polymorphism

Packt

09 Mar 2017

16 min read

In this article by Gastón C. Hillar, the author of the book Java 9 with JShell, we will learn about one of the most exciting features of object-oriented programming in Java 9: polymorphism. We will code many classes and then we will work with their instances in JShell to understand how objects can take many different forms. We will: Create concrete classes that inherit from abstract superclasses Work with instances of subclasses Understand polymorphism Control whether subclasses can or cannot override members Control whether classes can be subclassed Use methods that perform operations with instances of different subclasses (For more resources related to this topic, see here.) Creating concrete classes that inherit from abstract superclasses We will consider the existence of an abstract base class named VirtualAnimal and the following three abstract subclasses: VirtualMammal, VirtualDomesticMammal, and VirtualHorse. Next, we will code the following three concrete classes. Each class represents a different horse breed and is a subclass of the VirtualHorse abstract class. AmericanQuarterHorse: This class represents a virtual horse that belongs to the American Quarter Horse breed. ShireHorse: This class represents a virtual horse that belongs to the Shire Horse breed. Thoroughbred: This class represents a virtual horse that belongs to the Thoroughbred breed. The three concrete classes will implement the following three abstract methods they inherited from abstract superclasses: String getAsciiArt(): This abstract method is inherited from the VirtualAnimal abstract class. String getBaby(): This abstract method is inherited from the VirtualAnimal abstract class. String getBreed(): This abstract method is inherited from the VirtualHorse abstract class. The following UML diagram shows the members for the three concrete classes that we will code: AmericanQuarterHorse, ShireHorse, and Thoroughbred. We don’t use bold text format for the three methods that each of these concrete classes will declare because they aren’t overriding the methods, they are implementing the abstract methods that the classes inherited. First, we will create the AmericanQuarterHorse concrete class. The following lines show the code for this class in Java 9. Notice that there is no abstract keyword before class, and therefore, our class must make sure that it implements all the inherited abstract methods. public class AmericanQuarterHorse extends VirtualHorse { public AmericanQuarterHorse( int age, boolean isPregnant, String name, String favoriteToy) { super(age, isPregnant, name, favoriteToy); System.out.println("AmericanQuarterHorse created."); } public AmericanQuarterHorse( int age, String name, String favoriteToy) { this(age, false, name, favoriteToy); } public String getBaby() { return "AQH baby "; } public String getBreed() { return "American Quarter Horse"; } public String getAsciiArt() { return " >>\.n" + " /* )`.n" + " // _)`^)`. _.---. _n" + " (_,' \ `^-)'' `.\n" + " | | \n" + " \ / |n" + " / \ /.___.'\ (\ (_n" + " < ,'|| \ |`. \`-'n" + " \\ () )| )/n" + " |_>|> /_] //n" + " /_] /_]n"; } } Now, we will create the ShireHorse concrete class. The following lines show the code for this class in Java 9: public class ShireHorse extends VirtualHorse { public ShireHorse( int age, boolean isPregnant, String name, String favoriteToy) { super(age, isPregnant, name, favoriteToy); System.out.println("ShireHorse created."); } public ShireHorse( int age, String name, String favoriteToy) { this(age, false, name, favoriteToy); } public String getBaby() { return "ShireHorse baby "; } public String getBreed() { return "Shire Horse"; } public String getAsciiArt() { return " ;;n" + " .;;'*\n" + " __ .;;' ' \n" + " /' '\.~~.~' \ /'\.)n" + " ,;( ) / |n" + " ,;' \ /-.,,( )n" + " ) /| ) /|n" + " ||(_\ ||(_\n" + " (_\ (_\n"; } } Finally, we will create the Thoroughbred concrete class. The following lines show the code for this class in Java 9: public class Thoroughbred extends VirtualHorse { public Thoroughbred( int age, boolean isPregnant, String name, String favoriteToy) { super(age, isPregnant, name, favoriteToy); System.out.println("Thoroughbred created."); } public Thoroughbred( int age, String name, String favoriteToy) { this(age, false, name, favoriteToy); } public String getBaby() { return "Thoroughbred baby "; } public String getBreed() { return "Thoroughbred"; } public String getAsciiArt() { return " })\-=--.n" + " // *._.-'n" + " _.-=-...-' /n" + " {{| , |n" + " {{\ | \ /_n" + " }} \ ,'---'\___\n" + " / )/\\ \\ >\n" + " // >\ >\`-n" + " `- `- `-n"; } } We have more than one constructor defined for the three concrete classes. The first constructor that requires four arguments uses the super keyword to call the constructor from the base class or superclass, that is, the constructor defined in the VirtualHorse class. After the constructor defined in the superclass finishes its execution, the code prints a message indicating that an instance of each specific concrete class has been created. The constructor defined in each class prints a different message. The second constructor uses the this keyword to call the previously explained constructor with the received arguments and with false as the value for the isPregnant argument. Each class returns a different String in the implementation of the getBaby and getBreed methods. In addition, each class returns a different ASCII art representation for a virtual horse in the implementation of the getAsciiArt method. Understanding polymorphism We can use the same method, that is, a method with the same name and arguments, to cause different things to happen according to the class on which we invoke the method. In object-oriented programming, this feature is known as polymorphism. Polymorphism is the ability of an object to take on many forms, and we will see it in action by working with instances of the previously coded concrete classes. The following lines create a new instance of the AmericanQuarterHorse class named american and use one of its constructors that doesn’t require the isPregnant argument: AmericanQuarterHorse american = new AmericanQuarterHorse( 8, "American", "Equi-Spirit Ball"); american.printBreed(); The following lines show the messages that the different constructors displayed in JShell after we enter the previous code: VirtualAnimal created. VirtualMammal created. VirtualDomesticMammal created. VirtualHorse created. AmericanQuarterHorse created. The constructor defined in the AmericanQuarterHorse calls the constructor from its superclass, that is, the VirtualHorse class. Remember that each constructor calls its superclass constructor and prints a message indicating that an instance of the class is created. We don’t have five different instances; we just have one instance that calls the chained constructors of five different classes to perform all the necessary initialization to create an instance of AmericanQuarterHorse. If we execute the following lines in JShell, all of them will display true as a result, because american belongs to the VirtualAnimal, VirtualMammal, VirtualDomesticMammal, VirtualHorse, and AmericanQuarterHorse classes. System.out.println(american instanceof VirtualAnimal); System.out.println(american instanceof VirtualMammal); System.out.println(american instanceof VirtualDomesticMammal); System.out.println(american instanceof VirtualHorse); System.out.println(american instanceof AmericanQuarterHorse); The results of the previous lines mean that the instance of the AmericanQuarterHorse class, whose reference is saved in the american variable of type AmericanQuarterHorse, can take on the form of an instance of any of the following classes: VirtualAnimal VirtualMammal VirtualDomesticMammal VirtualHorse AmericanQuarterHorse The following screenshot shows the results of executing the previous lines in JShell: We coded the printBreed method within the VirtualHorse class, and we didn’t override this method in any of the subclasses. The following is the code for the printBreed method: public void printBreed() { System.out.println(getBreed()); } The code prints the String returned by the getBreed method, declared in the same class as an abstract method. The three concrete classes that inherit from VirtualHorse implemented the getBreed method and each of them returns a different String. When we called the american.printBreed method, JShell displayed American Quarter Horse. The following lines create an instance of the ShireHorse class named zelda. Note that in this case, we use the constructor that requires the isPregnant argument. As happened when we created an instance of the AmericanQuarterHorse class, JShell will display a message for each constructor that is executed as a result of the chained constructors we coded. ShireHorse zelda = new ShireHorse(9, true, "Zelda", "Tennis Ball"); The next lines call the printAverageNumberOfBabies and printAsciiArt instance methods for american, the instance of AmericanQuarterHorse, and zelda, which is the instance of ShireHorse. american.printAverageNumberOfBabies(); american.printAsciiArt(); zelda.printAverageNumberOfBabies(); zelda.printAsciiArt(); We coded the printAverageNumberOfBabies and printAsciiArt methods in the VirtualAnimal class, and we didn’t override them in any of its subclasses. Hence, when we call these methods for either american or zelda, Java will execute the code defined in the VirtualAnimal class. The printAverageNumberOfBabies method uses the int value returned by the getAverageNumberOfBabies and the String returned by the getBaby method to generate a String that represents the average number of babies for a virtual animal. The VirtualHorse class implemented the inherited getAverageNumberOfBabies abstract method with code that returns 1. The AmericanQuarterHorse and ShireHorse classes implemented the inherited getBaby abstract method with code that returns a String that represents a baby for the virtual horse breed: "AQH baby" and "ShireHorse baby". Thus, our call to the printAverageNumberOfBabies method will produce different results in each instance because they belong to a different class. The printAsciiArt method uses the String returned by the getAsciiArt method to print the ASCII art that represents a virtual horse. The AmericanQuarterHorse and ShireHorse classes implemented the inherited getAsciiArt abstract method with code that returns a String with the ASCII art that is appropriate for each virtual horse that the class represents. Thus, our call to the printAsciiArt method will produce different results in each instance because they belong to a different class. The following screenshot shows the results of executing the previous lines in JShell. Both instances run the same code for the two methods that were coded in the VirtualAnimal abstract class. However, each class provided a different implementation for the methods that end up being called to generated the result and cause the differences in the output. The following lines create an instance of the Thoroughbred class named willow, and then call its printAsciiArt method. As happened before, JShell will display a message for each constructor that is executed as a result of the chained constructors we coded. Thoroughbred willow = new Thoroughbred(5, "Willow", "Jolly Ball"); willow.printAsciiArt(); The following screenshot shows the results of executing the previous lines in JShell. The new instance is from a class that provides a different implementation of the getAsciiArt method, and therefore, we will see a different ASCII art than in the previous two calls to the same method for the other instances. The following lines call the neigh method for the instance named willow with a different number of arguments. This way, we take advantage of the neigh method that we overloaded four times with different arguments. Remember that we coded the four neigh methods in the VirtualHorse class and the Thoroughbred class inherits the overloaded methods from this superclass through its hierarchy tree. willow.neigh(); willow.neigh(2); willow.neigh(2, american); willow.neigh(3, zelda, true); american.nicker(); american.nicker(2); american.nicker(2, willow); american.nicker(3, willow, true); The following screenshot shows the results of calling the neigh and nicker methods with the different arguments in JShell: We called the four versions of the neigh method defined in the VirtualHorse class for the Thoroughbred instance named willow. The third and fourth lines that call the neigh method specify a value for the otherDomesticMammal argument of type VirtualDomesticMammal. The third line specifies american as the value for otherDomesticMammal and the fourth line specifies zelda as the value for the same argument. Both the AmericanQuarterHorse and ShireHorse concrete classes are subclasses of VirtualHorse, and VirtualHorse is a subclass or VirtualDomesticMammal. Hence, we can use american and zelda as arguments where a VirtualDomesticMammal instance is required. Then, we called the four versions of the nicker method defined in the VirtualHorse class for the AmericanQuarterHorse instance named american. The third and fourth lines that call the nicker method specify willow as the value for the otherDomesticMammal argument of type VirtualDomesticMammal. The Thoroughbred concrete class is also a subclass of VirtualHorse, and VirtualHorse is a subclass or VirtualDomesticMammal. Hence, we can use willow as an argument where a VirtualDomesticMammal instance is required. Controlling overridability of members in subclasses We will code the VirtualDomesticCat abstract class and its concrete subclass: MaineCoon. Then, we will code the VirtualBird abstract class, its VirtualDomesticBird abstract subclass and the Cockatiel concrete subclass. Finally, we will code the VirtualDomesticRabbit concrete class. While coding these classes, we will use Java 9 features that allow us to decide whether the subclasses can or cannot override specific members. All the virtual domestic cats must be able to talk, and therefore, we will override the talk method inherited from VirtualDomesticMammal to print the word that represents a cat meowing: "Meow". We also want to provide a method to print "Meow" a specific number of times. Hence, at this point, we realize that we can take advantage of the printSoundInWords method we had declared in the VirtualHorse class. We cannot access this instance method in the VirtualDomesticCat abstract class because it doesn’t inherit from VirtualHorse. Thus, we will move this method from the VirtualHorse class to its superclass: VirtualDomesticMammal. We will use the final keyword before the return type for the methods that we don’t want to be overridden in subclasses. When a method is marked as a final method, the subclasses cannot override the method and the Java 9 compiler shows an error if they try to do so. Not all the birds are able to fly in real-life. However, all our virtual birds are able to fly, and therefore, we will implement the inherited isAbleToFly abstract method as a final method that returns true. This way, we make sure that all the classes that inherit from the VirtualBird abstract class will always run this code for the isAbleToFly method and that they won’t be able to override it. The following UML diagram shows the members for the new abstract and concrete classes that we will code. In addition, the diagram shows the printSoundInWords method moved from the VirtualHorse abstract class to the VirtualDomesticMammal abstract class. First, we will create a new version of the VirtualDomesticMammal abstract class. We will add the printSoundInWords method that we have in the VirtualHorse abstract class and we will use the final keyword to indicate that we don’t want to allow subclasses to override this method. The following lines show the new code for the VirtualDomesticMammal class. public abstract class VirtualDomesticMammal extends VirtualMammal { public final String name; public String favoriteToy; public VirtualDomesticMammal( int age, boolean isPregnant, String name, String favoriteToy) { super(age, isPregnant); this.name = name; this.favoriteToy = favoriteToy; System.out.println("VirtualDomesticMammal created."); } public VirtualDomesticMammal( int age, String name, String favoriteToy) { this(age, false, name, favoriteToy); } protected final void printSoundInWords( String soundInWords, int times, VirtualDomesticMammal otherDomesticMammal, boolean isAngry) { String message = String.format("%s%s: %s%s", name, otherDomesticMammal == null ? "" : String.format(" to %s ", otherDomesticMammal.name), isAngry ? "Angry " : "", new String(new char[times]).replace(" ", soundInWords)); System.out.println(message); } public void talk() { System.out.println( String.format("%s: says something", name)); } } After we enter the previous lines, JShell will display the following messages: | update replaced class VirtualHorse which cannot be referenced until this error is corrected: | printSoundInWords(java.lang.String,int,VirtualDomesticMammal,boolean) in VirtualHorse cannot override printSoundInWords(java.lang.String,int,VirtualDomesticMammal,boolean) in VirtualDomesticMammal | overridden method is final | protected void printSoundInWords(String soundInWords, int times, | ^---------------------------------------------------------------... | update replaced class AmericanQuarterHorse which cannot be referenced until class VirtualHorse is declared | update replaced class ShireHorse which cannot be referenced until class VirtualHorse is declared | update replaced class Thoroughbred which cannot be referenced until class VirtualHorse is declared | update replaced variable american which cannot be referenced until class AmericanQuarterHorse is declared | update replaced variable zelda which cannot be referenced until class ShireHorse is declared | update replaced variable willow which cannot be referenced until class Thoroughbred is declared | update overwrote class VirtualDomesticMammal JShell indicates us that the VirtualHorse class and its subclasses cannot be referenced until we correct an error for this class. The class declares the printSoundInWords method and overrides the recently added method with the same name and arguments in the VirtualDomesticMammal. We used the final keyword in the new declaration to make sure that any subclass cannot override it, and therefore, the Java compiler generates the error message that JShell displays. Now, we will create a new version of the VirtualHorse abstract class. The following lines show the new version that removes the printSoundInWords method and uses the final keyword to make sure that many methods cannot be overridden by any of the subclasses. The declarations that use the final keyword to avoid the methods to be overridden are highlighted in the next lines. public abstract class VirtualHorse extends VirtualDomesticMammal { public VirtualHorse( int age, boolean isPregnant, String name, String favoriteToy) { super(age, isPregnant, name, favoriteToy); System.out.println("VirtualHorse created."); } public VirtualHorse( int age, String name, String favoriteToy) { this(age, false, name, favoriteToy); } public final boolean isAbleToFly() { return false; } public final boolean isRideable() { return true; } public final boolean isHervibore() { return true; } public final boolean isCarnivore() { return false; } public int getAverageNumberOfBabies() { return 1; } public abstract String getBreed(); public final void printBreed() { System.out.println(getBreed()); } public final void printNeigh( int times, VirtualDomesticMammal otherDomesticMammal, boolean isAngry) { printSoundInWords("Neigh ", times, otherDomesticMammal, isAngry); } public final void neigh() { printNeigh(1, null, false); } public final void neigh(int times) { printNeigh(times, null, false); } public final void neigh(int times, VirtualDomesticMammal otherDomesticMammal) { printNeigh(times, otherDomesticMammal, false); } public final void neigh(int times, VirtualDomesticMammal otherDomesticMammal, boolean isAngry) { printNeigh(times, otherDomesticMammal, isAngry); } public final void printNicker(int times, VirtualDomesticMammal otherDomesticMammal, boolean isAngry) { printSoundInWords("Nicker ", times, otherDomesticMammal, isAngry); } public final void nicker() { printNicker(1, null, false); } public final void nicker(int times) { printNicker(times, null, false); } public final void nicker(int times, VirtualDomesticMammal otherDomesticMammal) { printNicker(times, otherDomesticMammal, false); } public final void nicker(int times, VirtualDomesticMammal otherDomesticMammal, boolean isAngry) { printNicker(times, otherDomesticMammal, isAngry); } @Override public final void talk() { nicker(); } } After we enter the previous lines, JShell will display the following messages: | update replaced class AmericanQuarterHorse | update replaced class ShireHorse | update replaced class Thoroughbred | update replaced variable american, reset to null | update replaced variable zelda, reset to null | update replaced variable willow, reset to null | update overwrote class VirtualHorse We could replace the definition for the VirtualHorse class and the subclasses were also updated. It is important to know that the variables we declared in JShell that held references to instances of subclasses of VirtualHorse were set to null. Summary In this article, we created many abstract and concrete classes. We learned to control whether subclasses can or cannot override members, and whether classes can be subclassed. We worked with instances of many subclasses and we understood that objects can take many forms. We worked with many instances and their methods in JShell to understand how the classes and the methods that we coded are executed. We used methods that performed operations with instances of different classes that had a common superclass. Resources for Article: Further resources on this subject: Getting Started with Sorting Algorithms in Java [article] Introduction to JavaScript [article] Using Spring JMX within Java Applications [article]

0
0
9938

Packt

08 Mar 2017

13 min read

What is D3.js?

Packt

08 Mar 2017

13 min read

In this article by Ændrew H. Rininsland, the author of the book Learning D3.JS 4.x Data Visualization, we'll see what is new in D3 v4 and get started with Node and Git on the command line. (For more resources related to this topic, see here.) D3 (Data-Driven Documents), developed by Mike Bostock and the D3 community since 2011, is the successor to Bostock's earlier Protovis library. It allows pixel-perfect rendering of data by abstracting the calculation of things such as scales and axes into an easy-to-use domain-specific language (DSL), and uses idioms that should be immediately familiar to anyone with experience of using the popular jQuery JavaScript library. Much like jQuery, in D3, you operate on elements by selecting them and then manipulating via a chain of modifier functions. Especially within the context of data visualization, this declarative approach makes using it easier and more enjoyable than a lot of other tools out there. The official website, https://d3js.org/, features many great examples that show off the power of D3, but understanding them is tricky to start with. After finishing this book, you should be able to understand D3 well enough to figure out the examples, tweaking them to fit your needs. If you want to follow the development of D3 more closely, check out the source code hosted on GitHub at https://github.com/d3. The fine-grained control and its elegance make D3 one of the most powerful open source visualization libraries out there. This also means that it's not very suitable for simple jobs such as drawing a line chart or two-in that case you might want to use a library designed for charting. Many use D3 internally anyway. For a massive list, visit https://github.com/sorrycc/awesome-javascript#data-visualization. D3 is ultimately based around functional programming principles, which is currently experience a renaissance in the JavaScript community. This book really isn't about functional programming, but a lot of what we'll be doing will seem really familiar if you've ever used functional programming principles before. What happened to all the classes?! The second edition of this book contained quite a number of examples using the new class feature that is new in ES2015. The revised examples in this edition all use factory functions instead, and the class keyword never appears. Why is this, exactly? ES2015 classes are essentially just syntactic sugaring for factory functions. By this I mean that they ultimately compile down to that anyway. While classes can provide a certain level of organization to a complex piece of code, they ultimately hide what is going on underneath it all. Not only that, using OO paradigms like classes are effectively avoiding one of the most powerful and elegant aspects of JavaScript as a language, which is its focus on first-class functions and objects. Your code will be simpler and more elegant using functional paradigms than OO, and you'll find it less difficult to read examples in the D3 community, which almost never use classes. There are many, much more comprehensive arguments against using classes than I'm able to make here. For one of the best, please read Eric Elliott's excellent "The Two Pillars of JavaScript" pieces, at medium.com/javascript-scene/the-two-pillars-of-javascript-ee6f3281e7f3. What's new in D3 v4? One of the key changes to D3 since the last edition of this book is the release of version 4. Among its many changes, the most significant is a complete overhaul of the D3 namespace. This means that none of the examples in this book will work with D3 3.x, and the examples from the last book will not work with D3 4.x. This is quite possibly the cruelest thing Mr. Bostock could ever do to educational authors such as myself (I am totally joking here). Kidding aside, it also means many of the "block" examples in the D3 community are out-of-date and may appear rather odd if this book is your first encounter with the library. For this reason, it is very important to note the version of D3 an example uses - if it uses 3.x, it might be worth searching for a 4.x example just to prevent this cognitive dissonance. Related to this is how D3 has been broken up from a single library into many smaller libraries. There are two approaches you can take: you can use D3 as a single library in much the same way as version 3, or you can selectively use individual components of D3 in your project. This book takes the latter route, even if it does take a bit more effort - the benefit is primarily in that you'll have a better idea of how D3 is organized as a library and it reduces the size of the final bundle people who view your graphics will have to download. What's ES2017? One of the main changes to this book since the first edition is the emphasis on modern JavaScript; in this case, ES2017. Formerly known as ES6 (Harmony), it pushes the JavaScript language's features forward significantly, allowing for new usage patterns that simplify code readability and increase expressiveness. If you've written JavaScript before and the examples in this article look pretty confusing, it means you're probably familiar with the older, more common ES5 syntax. But don't sweat! It really doesn't take too long to get the hang of the new syntax, and I will try to explain the new language features as we encounter them. Although it might seem a somewhat steep learning curve at the start, by the end, you'll have improved your ability to write code quite substantially and will be on the cutting edge of contemporary JavaScript development. For a really good rundown of all the new toys you have with ES2016, check out this nice guide by the folks at Babel.js, which we will use extensively throughout this book: https://babeljs.io/docs/learn-es2015/. Before I go any further, let me clear some confusion about what ES2017 actually is. Initially, the ECMAScript (or ES for short) standards were incremented by cardinal numbers, for instance, ES4, ES5, ES6, and ES7. However, with ES6, they changed this so that a new standard is released every year in order to keep pace with modern development trends, and thus we refer to the year (2017) now. The big release was ES2015, which more or less maps to ES6. ES2016 was ratified in June 2016, and builds on the previous year's standard, while adding a few fixes and two new features. ES2017 is currently in the draft stage, which means proposals for new features are being considered and developed until it is ratified sometime in 2017. As a result of this book being written while these features are in draft, they may not actually make it into ES2017 and thus need to wait until a later standard to be officially added to the language. You don't really need to worry about any of this, however, because we use Babel.js to transpile everything down to ES5 anyway, so it runs the same in Node.js and in the browser. I try to refer to the relevant spec where a feature is added when I introduce it for the sake of accuracy (for instance, modules are an ES2015 feature), but when I refer to JavaScript, I mean all modern JavaScript, regardless of which ECMAScript spec it originated in. Getting started with Node and Git on the command line I will try not to be too opinionated in this book about which editor or operating system you should use to work through it (though I am using Atom on Mac OS X), but you are going to need a few prerequisites to start. The first is Node.js. Node is widely used for web development nowadays, and it's actually just JavaScript that can be run on the command line. Later on in this book, I'll show you how to write a server application in Node, but for now, let's just concentrate on getting it and npm (the brilliant and amazing package manager that Node uses) installed. If you're on Windows or Mac OS X without Homebrew, use the installer at https://nodejs.org/en/. If you're on Mac OS X and are using Homebrew, I would recommend installing "n" instead, which allows you to easily switch between versions of Node: $ brew install n $ n latest Regardless of how you do it, once you finish, verify by running the following lines: $ node --version $ npm --version If it displays the versions of node and npm it means you're good to go. I'm using 6.5.0 and 3.10.3, respectively, though yours might be slightly different-- the key is making sure node is at least version 6.0.0. If it says something similar to Command not found, double-check whether you've installed everything correctly, and verify that Node.js is in your $PATH environment variable. In the last edition of this book, we did a bunch of annoying stuff with Webpack and Babel and it was a bit too configuration-heavy to adequately explain. This time around we're using the lovely jspm for everything, which handles all the finicky annoying stuff for us. Install it now, using npm: npm install -g jspm@beta jspm-server This installs the most up-to-date beta version of jspm and the jspm development server. We don't need Webpack this time around because Rollup (which is used to bundle D3 itself) is used to bundle our projects, and jspm handles our Babel config for us. How helpful! Next, you'll want to clone the book's repository from GitHub. Change to your project directory and type this: $ git clone https://github.com/aendrew/learning-d3-v4 $ cd $ learning-d3-v4 This will clone the development environment and all the samples in the learning-d3-v4/ directory, as well as switch you into it. Another option is to fork the repository on GitHub and then clone your fork instead of mine as was just shown. This will allow you to easily publish your work on the cloud, enabling you to more easily seek support, display finished projects on GitHub Pages, and even submit suggestions and amendments to the parent project. This will help us improve this book for future editions. To do this, fork aendrew/learning-d3-v4 by clicking the "fork" button on GitHub, and replace aendrew in the preceding code snippet with your GitHub username. To switch between them, type the following command: $ git checkout <folder name> Replace <folder name> with the appropriate name of your folder. Stay at master for now though. To get back to it, type this line: $ git stash save && git checkout master The master branch is where you'll do a lot of your coding as you work through this book. It includes a prebuilt config.js file (used by jspm to manage dependencies), which we'll use to aid our development over the course of this book. We still need to install our dependencies, so let's do that now: $ npm install All of the source code that you'll be working on is in the lib/ folder. You'll notice it contains a just a main.js file; almost always, we'll be working in main.js, as index.html is just a minimal container to display our work in. This is it in its entirety, and it's the last time we'll look at any HTML in this book: <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Learning D3</title> </head> <body> <script src="jspm_packages/system.js"></script> <script src="config.js"></script> <script> System.import('lib/main.js'); </script> </body> </html> There's also an empty stylesheet in styles/index.css, which we'll add to in a bit. To get things rolling, start the development server by typing the following line: $ npm start This starts up the jspm development server, which will transform our new-fangled ES2017 JavaScript into backwards-compatible ES5, which can easily be loaded by most browsers. Instead of loading in a compiled bundle, we use SystemJS directly and load in main.js. When we're ready for production, we'll use jspm bundle to create an optimized JS payload. Now point Chrome (or whatever, I'm not fussy - so long as it's not Internet Explorer!) to localhost:8080 and fire up the developer console ( Ctrl + Shift + J for Linux and Windows and option + command + J for Mac). You should see a blank website and a blank JavaScript console with a Command Prompt waiting for some code: A quick Chrome Developer Tools primer Chrome Developer Tools are indispensable to web development. Most modern browsers have something similar, but to keep this book shorter, we'll stick to just Chrome here for the sake of simplicity. Feel free to use a different browser. Firefox's Developer Edition is particularly nice, and - yeah yeah, I hear you guys at the back; Opera is good too! We are mostly going to use the Elements and Console tabs, Elements to inspect the DOM and Console to play with JavaScript code and look for any problems. The other six tabs come in handy for large projects: The Network tab will let you know how long files are taking to load and help you inspect the Ajax requests. The Profiles tab will help you profile JavaScript for performance. The Resources tab is good for inspecting client-side data. Timeline and Audits are useful when you have a global variable that is leaking memory and you're trying to work out exactly why your library is suddenly causing Chrome to use 500 MB of RAM. While I've used these in D3 development, they're probably more useful when building large web applications with frameworks such as React and Angular. The main one you want to focus on, however, is Sources, which shows all the source code files that have been pulled in by the webpage. Not only is this useful in determining whether your code is actually loading, it contains a fully functional JavaScript debugger, which few mortals dare to use. While explaining how to debug code is kind of boring and not at the level of this article, learning to use breakpoints instead of perpetually using console.log to figure out what your code is doing is a skill that will take you far in the years to come. For a good overview, visit https://developers.google.com/web/tools/chrome-devtools/debug/breakpoints/step-code?hl=en Most of what you'll do with Developer Tools, however, is look at the CSS inspector at the right-hand side of the Elements tab. It can tell you what CSS rules are impacting the styling of an element, which is very good for hunting rogue rules that are messing things up. You can also edit the CSS and immediately see the results, as follows: Summary In this article, you learned what D3 is and took a glance at the core philosophy behind how it works. You also set up your computer for prototyping of ideas and to play with visualizations. Resources for Article: Further resources on this subject: Learning D3.js Mapping [article] Integrating a D3.js visualization into a simple AngularJS application [article] Simple graphs with d3.js [article]

0
0
3038

Packt

08 Mar 2017

8 min read

Toy Bin

Packt

08 Mar 2017

8 min read

In this article by Steffen Damtoft Sommer and Jim Campagno, the author of the book Swift 3 Programming for Kids, we will walk you through what an array is. These are considered collection types in Swift and are very powerful. (For more resources related to this topic, see here.) Array An array stores values of the same type in an ordered list. The following is an example of an array: let list = ["Legos", "Dungeons and Dragons", "Gameboy", "Monopoly", "Rubix Cube"] This is an array (which you can think of as a list). Arrays are an ordered collections of values. We've created a constant called list of the [String] type and assigned it a value that represents our list of toys that we want to take with us. When describing arrays, you surround the type of the values that are being stored in the array by square brackets, [String]. Following is another array called numbers which contains four values being 5, 2, 9 and 22: let numbers = [5, 2, 9, 22] You would describe numbers as being an array which contains Int values which can be written as [Int]. We can confirm this by holding Alt and selecting the numbers constant to see what its type is in a playground file: What if we were to go back and confirm that list is an array of String values. Let's Alt click that constant to make sure: Similar to how we created instances of String and Int without providing any type information, we're doing the same thing here when we create list and numbers. Both list and numbers are created taking advantage of type inference. In creating our two arrays, we weren't explicit in providing any type information, we just created the array and Swift was able to figure out the type of the array for us. If we want to, though, we can provide type information, as follows: let colors: [String] = ["Red", "Orange", "Yellow"] colors is a constant of the [String] type. Now that we know how to create an array in swift, which can be compared to a list in real life, how can we actually use it? Can we access various items from the array? If so, how? Also, can we add new items to the list in case we forgot to include any items? Yes to all of these questions. Every element (or item) in an array is indexed. What does that mean? Well, you can think of being indexed as being numbered. Except that there's one big difference between how we humans number things and how arrays number things. Humans start from 1 when they create a list (just like we did when we created our preceding list). An array starts from 0. So, the first element in an array is considered to be at index 0: Always remember that the first item in any array begins at 0. If we want to grab the first item from an array, we will do so as shown using what is referred to as subscript syntax: That 0 enclosed in two square brackets is what is known as subscript syntax. We are looking to access a certain element in the array at a certain index. In order to do that, we need to use subscript index, including the index of the item we want within square brackets. In doing so, it will return the value at the index. The value at the index in our preceding example is Legos. The = sign is also referred to as the assignment operator. So, we are assigning the Legos value to a new constant, called firstItem. If we were to print out firstItem, Legos should print to the console: print(firstItem) // Prints "Legos" If we want to grab the last item in this array, how do we do it? Well, there are five items in the array, so the last item should be at index 5, right? Wrong! What if we wrote the following code (which would be incorrect!): let lastItem = list[5] This would crash our application, which would be bad. When working with arrays, you need to ensure that you don't attempt to grab an item at a certain index which doesn't exist. There is no item in our array at index 5, which would make our application crash. When you run your app, you will receive the fatal error: Index out of range error. This is shown in the screenshot below: Let's correctly grab the last item in the array: let lastItem = list[4] print("I'm not as good as my sister, but I love solving the (lastItem)") // Prints "I'm not as good as my sister, but I love solving the Rubix Cube" Comments in code are made by writing text after //. None of this text will be considered code and will not be executed; it's a way for you to leave notes in your code. All of a sudden, you've now decided that you don't want to take the rubix cube as it's too difficult to play with. You were never able to solve it on Earth, so you start wondering why bringing it to the moon would help solve that problem. Bringing crayons is a much better idea. Let's swap out the rubix cube for crayons, but how do we do that? Using subscript syntax, we should be able to assign a new value to the array. Let's give it a shot: list[4] = "Crayons" This will not work! But why, can you take a guess? It's telling us that we cannot assign through subscript because list is a constant (we declared it using the let keyword). Ah! That's exactly how String and Int work. We decide whether or not we can change (mutate) the array based upon the let or var keyword just like every other type in Swift. Let's change the list array to a variable using the var keyword: var list = ["Legos", "Dungeons and Dragons", "Gameboy", "Monopoly", "Rubix Cube"] After doing so, we should be able to run this code without any problem: list[4] = "Crayons" If we decide to print the entire array, we will see the following print to console: ["Legos", "Dungeons and Dragons", "Gameboy", "Monopoly", "Crayons"] Note how Rubix Cube is no longer the value at index 4 (our last index); it has been changed to Crayons. That's how we can mutate (or change) elements at certain indexes in our array. What if we want to add a new item to the array, how do we do that? We've just saw that trying to use subscript syntax with an index that doesn't exist in our array crashes our application, so we know we can't use that to add new items to our array. Apple (having created Swift) has created hundreds, if not thousands, of functions that are available in all the different types (like String, Int, and array). You can consider yourself an instance of a person (person being the name of the type). Being an instance of a person, you can run, eat, sleep, study, and exercise (among other things). These things are considered functions (or methods) that are available to you. Your pet rock doesn't have these functions available to it, why? This is because it's an instance of a rock and not an instance of a person. An instance of a rock doesn't have the same functions available to it that an instance of a person has. All that being said, an array can do things that a String and Int can't do. No, arrays can't run or eat, but they can append (or add) new items to themselves. An array can do this by calling the append(_:) method available to it. This method can be called on an instance of an array (like the preceding list) using what is known as dot syntax. In dot syntax, you write the name of the method immediately after the instance name, separated by a period (.), without any space: list.append("Play-Doh") Just as if we were to tell a person to run, we are telling the list to append. However, we can't just tell it to append, we have to pass an argument to the append function so that it can add it to the list. Our list array now looks like this: ["Legos", "Dungeons and Dragons", "Gameboy", "Monopoly", "Crayons", "Play-Doh"] Summary We have covered a lot of material important to understanding Swift and writing iOS apps here. Feel free to reread what you've read so far as well as write code in a playground file. Create your own arrays, add whatever items you want to it, and change values at certain indexes. Get used to the syntax of working with creating an arrays as well as appending new items. If you can feel comfortable up to this point with how arrays work, that's awesome, keep up the great work! Resources for Article: Further resources on this subject: Introducing the Swift Programming Language [article] The Swift Programming Language [article] Functions in Swift [article]

0
0
25411

Brent Watson

08 Mar 2017

6 min read

Getting Started with Kotlin

Brent Watson

08 Mar 2017

6 min read

Kotlin has been gaining more and more attention recently. With its 1.1 release, the language has proved both stable and usable. In this article we’re going to cover a few different things: what is Kotlin and why it’s become popular, how to get started, where it’s most effective, and where to go next if you want to learn more. What is Kotlin? Kotlin is a programing language. Much like Scala or Groovy.It is a language that targets the JVM. It is developed by JetBrains, who make the IDEs that you know and love so much. The same diligence and finesse that is put into IntelliJ, PyCharm, Resharper, and their many other tools also shines through with Kotlin. There are two secret ingredients that make Kotlin such a joy to use. First, since Kotlin compiles to Java bytecode, it is 100% interoperable with your existing Java code. Second, since JetBrains has control over both language and IDE, the tooling support is beyond excellent. Here’s a quick example of some interoperability between Java and Kotlin: Person.kt data class Person( val title: String?, // String type ends with “?”, so it is nullable. val name: String, // val's are immutable. “String” field, so non-null. var age: Int// var's are mutable. ) PersonDemo.java Person person = new Person("Mr.", "John Doe", 23); person.getAge(); // data classes provide getter and setters automatically. person.toString(); // ... in addition to toString, equals, hashCode, and copy methods. The above example shows a “data class” (think Value Object / Data Transfer Object) in Kotlin being used by a Java class. Not only does the code work seamlessly, but also the JetBrains IDE allows you to navigate, auto-complete, debug, and refactor these together without skipping a beat. Continuing on with the above example, we’ll show how you might want to filter a list of Person objects using Kotlin. Filters.kt fun demoFilter(people: List<Person>) : List<String> { return people .filter { it.age>35 } .map { it.name} } FiltersApplied.java List<String> names = new Filters().demoFilter(people); The simple addition of higher order functions (such as map, filter, reduce, sum, zip, and so on) in Kotlin greatly reduces the boilerplate code you usually have to write in Java programs when iterating through collections. The above filtering code in Java would require you to create a temporary list of results, iterate through people, perform an if check on age, add name to the list, then finally return the list. The above Kotlin version is not only 1 line, but it can actually be reduced even further since Kotlin supports Single Expression Functions: fun demoFilter(people: List<Person>) = people.filter{ it.age>35 }.map { it.name } // return type and return keyword replaced with “=”. Another Kotlin feature that greatly reduces the boilerplate found in Java code comes from its more advanced type system that treats nullable types differently from non-null types. The Java version of: if (people != null && !people.isEmpty() &&people.get(0).getAddress() != null &&people.get(0).getAddress().getStreet() != null) { return people.get(0).getAddress().getStreet(); } Using Kotlin’s “?” operator that checks for null before continuing to evaluate an expression, we can simplify this statement drastically: return people?.firstOrNull()?.address?.street Not only does this reduce the verbosity inherent in Java, but it also helps to eliminate the “Billion dollar mistake” in Java: NullPointerExceptions. The ability to mark a type as either nullable or not-null (Address? vs Address) means that the compiler can ensure null checks are properly done at compile time, not at runtime in production. These are just a couple of examples of how Kotlin helps to both reduce the number of lines in your code and also reduce the often unneeded complexity. The more you use Kotlin, the more of these idioms you will find hard to start living without. Android More than any other industry, Kotlin has gained the most ground with Android developers. Since Kotlin compiles to Java 6 bytecode, it can be used to build Android applications. Since the interoperability between Kotlin and Java is so simple, it can be slowly added into a larger Java Android project over time (or any Java project for that matter). Given that Android developers do not yet have access to the nice features of Java 8, Kotlin provides these and gives Android developers many of the new language features they otherwise can only read about. The Kotlin team realized this early on and has provided many libraries and tools targeted at Android devs. Here are a couple short examples of how Kotlin can simplify working with the Android SDK: context.runOnUiThread{ ... } button?.setOnClickListener{ Toast.makeText(...) } Android developers will immediately understand the reduced boilerplate here. If you are not an Android developer, trust me, this is much better. If you are an Android developer I would suggest to you take a look at both Kotlin Android Extensions and Anko. Extension Functions The last feature of Kotlin we will look at today is one of the most useful features. This is the ability to write Extension Functions. Think of these as the ability to add your own method to any existing class. Here is a quick example of extending Java’s String class to add a prepend method: fun String.prepend(str: String) = str + this Once imported, you can use this from any Kotlin code as though it were a method on String. Consider the ability to extend any system or framework class. Suddenly all of your utility methods become extension methods and your code starts to look intentionally designed instead of patched together. Maybe you’d like to add a dpToPx() method on your Android Context class. Or, maybe you’d like to add a subscribeOnNewObserveOnMain() method on your RxJava Observable class. Well, now you can. Next Steps If you’re interested in trying Kotlin, grab a copy of the IntelliJ IDEA IDE or Android Studio and install the Kotlin plugin to get started. There is also a very well built online IDE maintained by JetBrains along with a series of exercises called Kotin Koans. These can be found at http://try.kotlinlang.org/. For more information on Kotlin, check out https://kotlinlang.org/ . About the author Brent Watson is an Android engineer in NYC. He is a developer, entrepreneur, author, TEDx speaker, and Kotlin advocate. He can be found at http://brentwatson.com/.

0
0
41190

How-To Tutorials

Packt

03 Mar 2017

17 min read

Data Pipelines

Packt

03 Mar 2017

17 min read

In this article by Andrew Morgan, Antoine Amend, Matthew Hallett, David George, the author of the book Mastering Spark for Data Science, readers will learn how to construct a content registerand use it to track all input loaded to the system, and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process. Readers will learn how to construct a content registerand use it to track all input loaded to the system, and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process. In this article we will cover the following topics: Welcome the GDELT Dataset Data Pipelines Universal Ingestion Framework Real-time monitoring for new data Receiving Streaming Data via Kafka Registering new content and vaulting for tracking purposes Visualization of content metrics in Kibana - to monitor ingestion processes & data health (For more resources related to this topic, see here.) Data Pipelines Even with the most basic of analytics, we always require some data. In fact, finding the right data is probably among the hardest problems to solve in data science (but that’s a whole topic for another book!). We have already seen that the way in which we obtain our data can be as simple or complicated as is needed. In practice, we can break this decision into two distinct areas: Ad-hoc and scheduled. Ad-hoc data acquisition is the most common method during prototyping and small scale analytics as it usually doesn’t require any additional software to implement - the user requires some data and simply downloads it from source as and when required. This method is often a matter of clicking on a web link and storing the data somewhere convenient, although the data may still need to be versioned and secure. Scheduled data acquisition is used in more controlled environments for large scale and production analytics, there is also an excellent case for ingesting a dataset into a data lake for possible future use. With Internet of Things (IoT) on the increase, huge volumes of data are being produced, in many cases if the data is not ingested now it is lost forever. Much of this data may not have an immediate or apparent use today, but could do in the future; so the mind-set is to gather all of the data in case it is needed and delete it later when sure it is not. It’s clear we need a flexible approach to data acquisition that supports a variety of procurement options. Universal Ingestion Framework There are many ways to approach data acquisition ranging from home grown bash scripts through to high-end commercial tools. The aim of this section is to introduce a highly flexible framework that we can use for small scale data ingest, and then grow as our requirements change - all the way through to a full corporately managed workflow if needed - that framework will be build using Apache NiFi. NiFi enables us to build large-scale integrated data pipelines that move data around the planet. In addition, it’s also incredibly flexible and easy to build simple pipelines - usually quicker even than using Bash or any other traditional scripting method. If an ad-hoc approach is taken to source the same dataset on a number of occasions, then some serious thought should be given as to whether it falls into the scheduled category, or at least whether a more robust storage and versioning setup should be introduced. We have chosen to use Apache NiFi as it offers a solution that provides the ability to create many, varied complexity pipelines that can be scaled to truly Big Data and IoT levels, and it also provides a great drag & drop interface (using what’s known as flow-based programming[1]). With patterns, templates and modules for workflow production, it automatically takes care of many of the complex features that traditionally plague developers such as multi-threading, connection management and scalable processing. For our purposes it will enable us to quickly build simple pipelines for prototyping, and scale these to full production where required. It’s pretty well documented and easy to get running https://nifi.apache.org/download.html, it runs in a browser and looks like this: https://en.wikipedia.org/wiki/Flow-based_programming We leave the installation of NiFi as an exercise for the reader - which we would encourage you to do - as we will be using it in the following section. Introducing the GDELT News Stream Hopefully, we have NiFi up and running now and can start to ingest some data. So let’s start with some global news media data from GDELT. Here’s our brief, taken from the GDELT website http://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/: “Within 15 minutes of GDELT monitoring a news report breaking anywhere the world, it has translated it, processed it to identify all events, counts, quotes, people, organizations, locations, themes, emotions, relevant imagery, video, and embedded social media posts, placed it into global context, and made all of this available via a live open metadata firehose enabling open research on the planet itself. [As] the single largest deployment in the world of sentiment analysis, we hope that by bringing together so many emotional and thematic dimensions crossing so many languages and disciplines, and applying all of it in realtime to breaking news from across the planet, that this will spur an entirely new era in how we think about emotion and the ways in which it can help us better understand how we contextualize, interpret, respond to, and understand global events.” In order to start consuming this open data, we’ll need to hook into that metadata firehose and ingest the news streams onto our platform. How do we do this? Let’s start by finding out what data is available. Discover GDELT Real-time GDELT publish a list of the latest files on their website - this list is updated every 15 minutes. In NiFi, we can setup a dataflow that will poll the GDELT website, source a file from this list and save it to HDFS so we can use it later. Inside the NiFi dataflow designer, create a HTTP connector by dragging a processor onto the canvas and selecting GetHTTP. To configure this processor, you’ll need to enter the URL of the file list as: http://data.gdeltproject.org/gdeltv2/lastupdate.txt And also provide a temporary filename for the file list you will download. In the example below, we’ve used the NiFi’s expression language to generate a universally unique key so that files are not overwritten (UUID()). It’s worth noting that with this type of processor (GetHTTP), NiFi supports a number of scheduling and timing options for the polling and retrieval. For now, we’re just going to use the default options and let NiFi manage the polling intervals for us. An example of latest file list from GDELT is shown below. Next, we will parse the URL of the GKG news stream so that we can fetch it in a moment. Create a Regular Expression parser by dragging a processor onto the canvas and selecting ExtractText. Now position the new processor underneath the existing one and drag a line from the top processor to the bottom one. Finish by selecting the success relationship in the connection dialog that pops up. This is shown in the example below. Next, let’s configure the ExtractText processor to use a regular expression that matches only the relevant text of the file list, for example: ([^ ]*gkg.csv.*) From this regular expression, NiFi will create a new property (in this case, called url) associated with the flow design, which will take on a new value as each particular instance goes through the flow. It can even be configured to support multiple threads. Again, this is example is shown below. It’s worth noting here that while this is a fairly specific example, the technique is deliberately general purpose and can be used in many situations. Our First GDELT Feed Now that we have the URL of the GKG feed, we fetch it by configuring an InvokeHTTP processor to use the url property we previously created as it’s remote endpoint, and dragging the line as before. All that remains is to decompress the zipped content with a UnpackContent processor (using the basic zip format) and save to HDFS using a PutHDFS processor, like so: Improving with Publish and Subscribe So far, this flow looks very “point-to-point”, meaning that if we were to introduce a new consumer of data, for example, a Spark-streaming job, the flow must be changed. For example, the flow design might have to change to look like this: If we add yet another, the flow must change again. In fact, each time we add a new consumer, the flow gets a little more complicated, particularly when all the error handling is added. This is clearly not always desirable, as introducing or removing consumers (or producers) of data, might be something we want to do often, even frequently. Plus, it’s also a good idea to try to keep your flows as simple and reusable as possible. Therefore, for a more flexible pattern, instead of writing directly to HDFS, we can publish to Apache Kafka. This gives us the ability to add and remove consumers at any time without changing the data ingestion pipeline. We can also still write to HDFS from Kafka if needed, possibly even by designing a separate NiFi flow, or connect directly to Kafka using Spark-streaming. To do this, we create a Kafka writer by dragging a processor onto the canvas and selecting PutKafka. We now have a simple flow that continuously polls for an available file list, routinely retrieving the latest copy of a new stream over the web as it becomes available, decompressing the content and streaming it record-by-record into Kafka, a durable, fault-tolerant, distributed message queue, for processing by spark-streaming or storage in HDFS. And what’s more, without writing a single line of bash! Content Registry We have seen in this article that data ingestion is an area that is often overlooked, and that its importance cannot be underestimated. At this point we have a pipeline that enables us to ingest data from a source, schedule that ingest and direct the data to our repository of choice. But the story does not end there. Now we have the data, we need to fulfil our data management responsibilities. Enter the content registry. We’re going to build an index of metadata related to that data we have ingested. The data itself will still be directed to storage (HDFS, in our example) but, in addition, we will store metadata about the data, so that we can track what we’ve received and understand basic information about it, such as, when we received it, where it came from, how big it is, what type it is, etc. Choices and More Choices The choice of which technology we use to store this metadata is, as we have seen, one based upon knowledge and experience. For metadata indexing, we will require at least the following attributes: Easily searchable Scalable Parallel write ability Redundancy There are many ways to meet these requirements, for example we could write the metadata to Parquet, store in HDFS and search using Spark SQL. However, here we will use Elasticsearch as it meets the requirements a little better, most notably because it facilitates low latency queries of our metadata over a REST API - very useful for creating dashboards. In fact, Elasticsearch has the advantage of integrating directly with Kibana, meaning it can quickly produce rich visualizations of our content registry. For this reason, we will proceed with Elasticsearch in mind. Going with the Flow Using our current NiFi pipeline flow, let’s fork the output from “Fetch GKG files from URL” to add an additional set of steps to allow us to capture and store this metadata in Elasticsearch. These are: Replace the flow content with our metadata model Capture the metadata Store directly in Elasticsearch Here’s what this looks like in NiFi: Metadata Model So, the first step here is to define our metadata model. And there are many areas we could consider, but let’s select a set that helps tackle a few key points from earlier discussions. This will provide a good basis upon which further data can be added in the future, if required. So, let’s keep it simple and use the following three attributes: File size Date ingested File name These will provide basic registration of received files. Next, inside the NiFi flow, we’ll need to replace the actual data content with this new metadata model. An easy way to do this, is to create a JSON template file from our model. We’ll save it to local disk and use it inside a FetchFile processor to replace the flow’s content with this skeleton object. This template will look something like: { "FileSize": SIZE, "FileName": "FILENAME", "IngestedDate": "DATE" } Note the use of placeholder names (SIZE, FILENAME, DATE) in place of the attribute values. These will be substituted, one-by-one, by a sequence of ReplaceText processors, that swap the placeholder names for an appropriate flow attribute using regular expressions provided by the NiFi Expression Language, for example DATE becomes ${now()}. The last step is to output the new metadata payload to Elasticsearch. Once again, NiFi comes ready with a processor for this; the PutElasticsearch processor. An example metadata entry in Elasticsearch: { "_index": "gkg", "_type": "files", "_id": "AVZHCvGIV6x-JwdgvCzW", "_score": 1, "source": { "FileSize": 11279827, "FileName": "20150218233000.gkg.csv.zip", "IngestedDate": "2016-08-01T17:43:00+01:00" } } Now that we have added the ability to collect and interrogate metadata, we now have access to more statistics that can be used for analysis. This includes: Time based analysis e.g. file sizes over time Loss of data, for example are there data “holes” in the timeline? If there is a particular analytic that is required, the NIFI metadata component can be adjusted to provide the relevant data points. Indeed, an analytic could be built to look at historical data and update the index accordingly if the metadata does not exist in current data. Kibana Dashboard We have mentioned Kibana a number of times in this article, now that we have an index of metadata in Elasticsearch, we can use the tool to visualize some analytics. The purpose of this brief section is to demonstrate that we can immediately start to model and visualize our data. In this simple example we have completed the following steps: Added the Elasticsearch index for our GDELT metadata to the “Settings” tab Selected “file size” under the “Discover” tab Selected Visualize for “file size” Changed the Aggregation field to “Range” Entered values for the ranges The resultant graph displays the file size distribution: From here we are free to create new visualizations or even a fully featured dashboard that can be used to monitor the status of our file ingest. By increasing the variety of metadata written to Elasticsearch from NiFi, we can make more fields available in Kibana and even start our data science journey right here with some ingest based actionable insights. Now that we have a fully-functioning data pipeline delivering us real-time feeds of data, how do we ensure data quality of the payload we are receiving? Let’s take a look at the options. Quality Assurance With an initial data ingestion capability implemented, and data streaming onto your platform, you will need to decide how much quality assurance is required at the front door. It’s perfectly viable to start with no initial quality controls and build them up over time (retrospectively scanning historical data as time and resources allow). However, it may be prudent to install a basic level of verification to begin with. For example, basic checks such as file integrity, parity checking, completeness, checksums, type checking, field counting, overdue files, security field pre-population, denormalization, etc. You should take care that your up-front checks do not take too long. Depending on the intensity of your examinations and the size of your data, it’s not uncommon to encounter a situation where there is not enough time to perform all processing before the next dataset arrives. You will always need to monitor your cluster resources and calculate the most efficient use of time. Here are some examples of the type of rough capacity planning calculation you can perform: Example 1: Basic Quality Checking, No Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population) takes 4 minutes There are no other users on the compute cluster There are 10 minutes of resources available for other tasks. As there are no other users on the cluster, this is satisfactory - no action needs to be taken. Example 2: Advanced Quality Checking, No Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population, denormalization, sub dataset building) takes 13 minutes There are no other users on the compute cluster There is only 1 minute of resource available for other tasks. We probably need to consider, either: Configuring a resource scheduling policy Reducing the amount of data ingested Reducing the amount of processing we undertake Adding additional compute resources to the cluster Example 3: Basic Quality Checking, 50% Utility Due to Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population) takes 4 minutes (100% utility) There are other users on the compute cluster There are 6 minutes of resources available for other tasks (15 - 1 - (4 * (100 / 50))). Since there are other users there is a danger that, at least some of the time, we will not be able to complete our processing and a backlog of jobs will occur. When you run into timing issues, you have a number of options available to you in order to circumvent any backlog: Negotiating sole use of the resources at certain times Configuring a resource scheduling policy, including: YARN Fair Scheduler: allows you to define queues with differing priorities and target your Spark jobs by setting the spark.yarn.queue property on start-up so your job always takes precedence Dynamicandr Resource Allocation: allows concurrently running jobs to automatically scale to match their utilization Spark Scheduler Pool: allows you to define queues when sharing a SparkContext using multithreading model, and target your Spark job by setting the spark.scheduler.pool property per execution thread so your thread takes precedence Running processing jobs overnight when the cluster is quiet In any case, you will eventually get a good idea of how the various parts to your jobs perform and will then be in a position to calculate what changes could be made to improve efficiency. There’s always the option of throwing more resources at the problem, especially when using a cloud provider, but we would certainly encourage the intelligent use of existing resources - this is far more scalable, cheaper and builds data expertise. Summary In this article we walked through the full setup of an Apache NiFi GDELT ingest pipeline, complete with metadata forks and a brief introduction to visualizing the resultant data. This section is particularly important as GDELT is used extensively throughout the book and the NiFi method is a highly effective way to source data in a scalable and modular way. Resources for Article: Further resources on this subject: Integration with Continuous Delivery [article] Amazon Web Services [article] AWS Fundamentals [article]

0
1
16359

Packt

03 Mar 2017

18 min read

The NumPy array object

Packt

03 Mar 2017

18 min read

In this article by Armando Fandango author of the book Python Data Analysis - Second Edition, discuss how the NumPy provides a multidimensional array object called ndarray. NumPy arrays are typed arrays of fixed size. Python lists are heterogeneous and thus elements of a list may contain any object type, while NumPy arrays are homogenous and can contain object of only one type. An ndarray consists of two parts, which are as follows: The actual data that is stored in a contiguous block of memory The metadata describing the actual data Since the actual data is stored in a contiguous block of memory hence loading of the large data set as ndarray is affected by availability of large enough contiguous block of memory. Most of the array methods and functions in NumPy leave the actual data unaffected and only modify the metadata. Actually, we made a one-dimensional array that held a set of numbers. The ndarray can have more than a single dimension. (For more resources related to this topic, see here.) Advantages of NumPy arrays The NumPy array is, in general, homogeneous (there is a particular record array type that is heterogeneous)—the items in the array have to be of the same type. The advantage is that if we know that the items in an array are of the same type, it is easy to ascertain the storage size needed for the array. NumPy arrays can execute vectorized operations, processing a complete array, in contrast to Python lists, where you usually have to loop through the list and execute the operation on each element. NumPy arrays are indexed from 0, just like lists in Python. NumPy utilizes an optimized C API to make the array operations particularly quick. We will make an array with the arange() subroutine again. You will see snippets from Jupyter Notebook sessions where NumPy is already imported with instruction import numpy as np. Here's how to get the data type of an array: In: a = np.arange(5) In: a.dtype Out: dtype('int64') The data type of the array a is int64 (at least on my computer), but you may get int32 as the output if you are using 32-bit Python. In both the cases, we are dealing with integers (64 bit or 32 bit). Besides the data type of an array, it is crucial to know its shape. A vector is commonly used in mathematics but most of the time we need higher-dimensional objects. Let's find out the shape of the vector we produced a few minutes ago: In: a Out: array([0, 1, 2, 3, 4]) In: a.shape Out: (5,) As you can see, the vector has five components with values ranging from 0 to 4. The shape property of the array is a tuple; in this instance, a tuple of 1 element, which holds the length in each dimension. Creating a multidimensional array Now that we know how to create a vector, we are set to create a multidimensional NumPy array. After we produce the matrix, we will again need to show its, as demonstrated in the following code snippets: Create a multidimensional array as follows: In: m = np.array([np.arange(2), np.arange(2)]) In: m Out: array([[0, 1], [0, 1]]) We can show the array shape as follows: In: m.shape Out: (2, 2) We made a 2 x 2 array with the arange() subroutine. The array() function creates an array from an object that you pass to it. The object has to be an array, for example, a Python list. In the previous example, we passed a list of arrays. The object is the only required parameter of the array() function. NumPy functions tend to have a heap of optional arguments with predefined default options. Selecting NumPy array elements From time to time, we will wish to select a specific constituent of an array. We will take a look at how to do this, but to kick off, let's make a 2 x 2 matrix again: In: a = np.array([[1,2],[3,4]]) In: a Out: array([[1, 2], [3, 4]]) The matrix was made this time by giving the array() function a list of lists. We will now choose each item of the matrix one at a time, as shown in the following code snippet. Recall that the index numbers begin from 0: In: a[0,0] Out: 1 In: a[0,1] Out: 2 In: a[1,0] Out: 3 In: a[1,1] Out: 4 As you can see, choosing elements of an array is fairly simple. For the array a, we just employ the notation a[m,n], where m and n are the indices of the item in the array. Have a look at the following figure for your reference: NumPy numerical types Python has an integer type, a float type, and complex type; nonetheless, this is not sufficient for scientific calculations. In practice, we still demand more data types with varying precisions and, consequently, different storage sizes of the type. For this reason, NumPy has many more data types. The bulk of the NumPy mathematical types ends with a number. This number designates the count of bits related to the type. The following table (adapted from the NumPy user guide) presents an overview of NumPy numerical types: Type Description bool Boolean (True or False) stored as a bit inti Platform integer (normally either int32 or int64) int8 Byte (-128 to 127) int16 Integer (-32768 to 32767) int32 Integer (-2 ** 31 to 2 ** 31 -1) int64 Integer (-2 ** 63 to 2 ** 63 -1) uint8 Unsigned integer (0 to 255) uint16 Unsigned integer (0 to 65535) uint32 Unsigned integer (0 to 2 ** 32 - 1) uint64 Unsigned integer (0 to 2 ** 64 - 1) float16 Half precision float: sign bit, 5 bits exponent, and 10 bits mantissa float32 Single precision float: sign bit, 8 bits exponent, and 23 bits mantissa float64 or float Double precision float: sign bit, 11 bits exponent, and 52 bits mantissa complex64 Complex number, represented by two 32-bit floats (real and imaginary components) complex128 or complex Complex number, represented by two 64-bit floats (real and imaginary components) For each data type, there exists a matching conversion function: In: np.float64(42) Out: 42.0 In: np.int8(42.0) Out: 42 In: np.bool(42) Out: True In: np.bool(0) Out: False In: np.bool(42.0) Out: True In: np.float(True) Out: 1.0 In: np.float(False) Out: 0.0 Many functions have a data type argument, which is frequently optional: In: np.arange(7, dtype= np.uint16) Out: array([0, 1, 2, 3, 4, 5, 6], dtype=uint16) It is important to be aware that you are not allowed to change a complex number into an integer. Attempting to do that sparks off a TypeError: In: np.int(42.0 + 1.j) Traceback (most recent call last): <ipython-input-24-5c1cd108488d> in <module>() ----> 1 np.int(42.0 + 1.j) TypeError: can't convert complex to int The same goes for conversion of a complex number into a floating-point number. By the way, the j component is the imaginary coefficient of a complex number. Even so, you can convert a floating-point number to a complex number, for example, complex(1.0). The real and imaginary pieces of a complex number can be pulled out with the real() and imag() functions, respectively. Data type objects Data type objects are instances of the numpy.dtype class. Once again, arrays have a data type. To be exact, each element in a NumPy array has the same data type. The data type object can tell you the size of the data in bytes. The size in bytes is given by the itemsize property of the dtype class : In: a.dtype.itemsize Out: 8 Character codes Character codes are included for backward compatibility with Numeric. Numeric is the predecessor of NumPy. Its use is not recommended, but the code is supplied here because it pops up in various locations. You should use the dtype object instead. The following table lists several different data types and character codes related to them: Type Character code integer i Unsigned integer u Single precision float f Double precision float d bool b complex D string S unicode U Void V Take a look at the following code to produce an array of single precision floats: In: arange(7, dtype='f') Out: array([ 0., 1., 2., 3., 4., 5., 6.], dtype=float32) Likewise, the following code creates an array of complex numbers: In: arange(7, dtype='D') In: arange(7, dtype='D') Out: array([ 0.+0.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j, 6.+0.j]) The dtype constructors We have a variety of means to create data types. Take the case of floating-point data (have a look at dtypeconstructors.py in this book's code bundle): We can use the general Python float, as shown in the following lines of code: In: np.dtype(float) Out: dtype('float64') We can specify a single precision float with a character code: In: np.dtype('f') Out: dtype('float32') We can use a double precision float with a character code: In: np.dtype('d') Out: dtype('float64') We can pass the dtype constructor a two-character code. The first character stands for the type; the second character is a number specifying the number of bytes in the type (the numbers 2, 4, and 8 correspond to floats of 16, 32, and 64 bits, respectively): In: np.dtype('f8') Out: dtype('float64') A (truncated) list of all the full data type codes can be found by applying sctypeDict.keys(): In: np.sctypeDict.keys() In: np.sctypeDict.keys() Out: dict_keys(['?', 0, 'byte', 'b', 1, 'ubyte', 'B', 2, 'short', 'h', 3, 'ushort', 'H', 4, 'i', 5, 'uint', 'I', 6, 'intp', 'p', 7, 'uintp', 'P', 8, 'long', 'l', 'L', 'longlong', 'q', 9, 'ulonglong', 'Q', 10, 'half', 'e', 23, 'f', 11, 'double', 'd', 12, 'longdouble', 'g', 13, 'cfloat', 'F', 14, 'cdouble', 'D', 15, 'clongdouble', 'G', 16, 'O', 17, 'S', 18, 'unicode', 'U', 19, 'void', 'V', 20, 'M', 21, 'm', 22, 'bool8', 'Bool', 'b1', 'float16', 'Float16', 'f2', 'float32', 'Float32', 'f4', 'float64', ' Float64', 'f8', 'float128', 'Float128', 'f16', 'complex64', 'Complex32', 'c8', 'complex128', 'Complex64', 'c16', 'complex256', 'Complex128', 'c32', 'object0', 'Object0', 'bytes0', 'Bytes0', 'str0', 'Str0', 'void0', 'Void0', 'datetime64', 'Datetime64', 'M8', 'timedelta64', 'Timedelta64', 'm8', 'int64', 'uint64', 'Int64', 'UInt64', 'i8', 'u8', 'int32', 'uint32', 'Int32', 'UInt32', 'i4', 'u4', 'int16', 'uint16', 'Int16', 'UInt16', 'i2', 'u2', 'int8', 'uint8', 'Int8', 'UInt8', 'i1', 'u1', 'complex_', 'int0', 'uint0', 'single', 'csingle', 'singlecomplex', 'float_', 'intc', 'uintc', 'int_', 'longfloat', 'clongfloat', 'longcomplex', 'bool_', 'unicode_', 'object_', 'bytes_', 'str_', 'string_', 'int', 'float', 'complex', 'bool', 'object', 'str', 'bytes', 'a']) The dtype attributes The dtype class has a number of useful properties. For instance, we can get information about the character code of a data type through the properties of dtype: In: t = np.dtype('Float64') In: t.char Out: 'd' The type attribute corresponds to the type of object of the array elements: In: t.type Out: numpy.float64 The str attribute of dtype gives a string representation of a data type. It begins with a character representing endianness, if appropriate, then a character code, succeeded by a number corresponding to the number of bytes that each array item needs. Endianness, here, entails the way bytes are ordered inside a 32- or 64-bit word. In the big-endian order, the most significant byte is stored first, indicated by >. In the little-endian order, the least significant byte is stored first, indicated by <, as exemplified in the following lines of code: In: t.str Out: '<f8' One-dimensional slicing and indexing Slicing of one-dimensional NumPy arrays works just like the slicing of standard Python lists. Let's define an array containing the numbers 0, 1, 2, and so on up to and including 8. We can select a part of the array from indexes 3 to 7, which extracts the elements of the arrays 3 through 6: In: a = np.arange(9) In: a[3:7] Out: array([3, 4, 5, 6]) We can choose elements from indexes the 0 to 7 with an increment of 2: In: a[:7:2] Out: array([0, 2, 4, 6]) Just as in Python, we can use negative indices and reverse the array: In: a[::-1] Out: array([8, 7, 6, 5, 4, 3, 2, 1, 0]) Manipulating array shapes We have already learned about the reshape() function. Another repeating chore is the flattening of arrays. Flattening in this setting entails transforming a multidimensional array into a one-dimensional array. Let us create an array b that we shall use for practicing the further examples: In: b = np.arange(24).reshape(2,3,4) In: print(b) Out: [[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) We can manipulate array shapes using the following functions: Ravel: We can accomplish this with the ravel() function as follows: In: b Out: array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) In: b.ravel() Out: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]) Flatten: The appropriately named function, flatten(), does the same as ravel(). However, flatten() always allocates new memory, whereas ravel gives back a view of the array. This means that we can directly manipulate the array as follows: In: b.flatten() Out: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]) Setting the shape with a tuple: Besides the reshape() function, we can also define the shape straightaway with a tuple, which is exhibited as follows: In: b.shape = (6,4) In: b Out: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]) As you can understand, the preceding code alters the array immediately. Now, we have a 6 x 4 array. Transpose: In linear algebra, it is common to transpose matrices. Transposing is a way to transform data. For a two-dimensional table, transposing means that rows become columns and columns become rows. We can do this too by using the following code: In: b.transpose() Out: array([[ 0, 4, 8, 12, 16, 20], [ 1, 5, 9, 13, 17, 21], [ 2, 6, 10, 14, 18, 22], [ 3, 7, 11, 15, 19, 23]]) Resize: The resize() method works just like the reshape() method, In: b.resize((2,12)) In: b Out: array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]) Stacking arrays Arrays can be stacked horizontally, depth wise, or vertically. We can use, for this goal, the vstack(), dstack(), hstack(), column_stack(), row_stack(), and concatenate() functions. To start with, let's set up some arrays: In: a = np.arange(9).reshape(3,3) In: a Out: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In: b = 2 * a In: b Out: array([[ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) As mentioned previously, we can stack arrays using the following techniques: Horizontal stacking: Beginning with horizontal stacking, we will shape a tuple of ndarrays and hand it to the hstack() function to stack the arrays. This is shown as follows: In: np.hstack((a, b)) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) We can attain the same thing with the concatenate() function, which is shown as follows: In: np.concatenate((a, b), axis=1) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) The following diagram depicts horizontal stacking: Vertical stacking: With vertical stacking, a tuple is formed again. This time it is given to the vstack() function to stack the arrays. This can be seen as follows: In: np.vstack((a, b)) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) The concatenate() function gives the same outcome with the axis parameter fixed to 0. This is the default value for the axis parameter, as portrayed in the following code: In: np.concatenate((a, b), axis=0) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) Refer to the following figure for vertical stacking: Depth stacking: To boot, there is the depth-wise stacking employing dstack() and a tuple, of course. This entails stacking a list of arrays along the third axis (depth). For example, we could stack 2D arrays of image data on top of each other as follows: In: np.dstack((a, b)) Out: array([[[ 0, 0], [ 1, 2], [ 2, 4]], [[ 3, 6], [ 4, 8], [ 5, 10]], [[ 6, 12], [ 7, 14], [ 8, 16]]]) Column stacking: The column_stack() function stacks 1D arrays column-wise. This is shown as follows: In: oned = np.arange(2) In: oned Out: array([0, 1]) In: twice_oned = 2 * oned In: twice_oned Out: array([0, 2]) In: np.column_stack((oned, twice_oned)) Out: array([[0, 0], [1, 2]]) 2D arrays are stacked the way the hstack() function stacks them, as demonstrated in the following lines of code: In: np.column_stack((a, b)) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) In: np.column_stack((a, b)) == np.hstack((a, b)) Out: array([[ True, True, True, True, True, True], [ True, True, True, True, True, True], [ True, True, True, True, True, True]], dtype=bool) Yes, you guessed it right! We compared two arrays with the == operator. Row stacking: NumPy, naturally, also has a function that does row-wise stacking. It is named row_stack() and for 1D arrays, it just stacks the arrays in rows into a 2D array: In: np.row_stack((oned, twice_oned)) Out: array([[0, 1], [0, 2]]) The row_stack() function results for 2D arrays are equal to the vstack() function results: In: np.row_stack((a, b)) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) In: np.row_stack((a,b)) == np.vstack((a, b)) Out: array([[ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True]], dtype=bool) Splitting NumPy arrays Arrays can be split vertically, horizontally, or depth wise. The functions involved are hsplit(), vsplit(), dsplit(), and split(). We can split arrays either into arrays of the same shape or indicate the location after which the split should happen. Let's look at each of the functions in detail: Horizontal splitting: The following code splits a 3 x 3 array on its horizontal axis into three parts of the same size and shape (see splitting.py in this book's code bundle): In: a Out: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In: np.hsplit(a, 3) Out: [array([[0], [3], [6]]), array([[1], [4], [7]]), array([[2], [5], [8]])] Liken it with a call of the split() function, with an additional argument, axis=1: In: np.split(a, 3, axis=1) Out: [array([[0], [3], [6]]), array([[1], [4], [7]]), array([[2], [5], [8]])] Vertical splitting: vsplit() splits along the vertical axis: In: np.vsplit(a, 3) Out: [array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])] The split() function, with axis=0, also splits along the vertical axis: In: np.split(a, 3, axis=0) Out: [array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])] Depth-wise splitting: The dsplit() function, unsurprisingly, splits depth-wise. We will require an array of rank 3 to begin with: In: c = np.arange(27).reshape(3, 3, 3) In: c Out: array([[[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8]], [[ 9, 10, 11], [12, 13, 14], [15, 16, 17]], [[18, 19, 20], [21, 22, 23], [24, 25, 26]]]) In: np.dsplit(c, 3) Out: [array([[[ 0], [ 3], [ 6]], [[ 9], [12], [15]], [[18], [21], [24]]]), array([[[ 1], [ 4], [ 7]], [[10], [13], [16]], [[19], [22], [25]]]), array([[[ 2], [ 5], [ 8]], [[11], [14], [17]], [[20], [23], [26]]])] NumPy array attributes Let's learn more about the NumPy array attributes with the help of an example. Let us create an array b that we shall use for practicing the further examples: In: b = np.arange(24).reshape(2, 12) In: b Out: array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]) Besides the shape and dtype attributes, ndarray has a number of other properties, as shown in the following list: ndim gives the number of dimensions, as shown in the following code snippet: In: b.ndim Out: 2 size holds the count of elements. This is shown as follows: In: b.size Out: 24 itemsize returns the count of bytes for each element in the array, as shown in the following code snippet: In: b.itemsize Out: 8 If you require the full count of bytes the array needs, you can have a look at nbytes. This is just a product of the itemsize and size properties: In: b.nbytes Out: 192 In: b.size * b.itemsize Out: 192 The T property has the same result as the transpose() function, which is shown as follows: In: b.resize(6,4) In: b Out: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]) In: b.T Out: array([[ 0, 4, 8, 12, 16, 20], [ 1, 5, 9, 13, 17, 21], [ 2, 6, 10, 14, 18, 22], [ 3, 7, 11, 15, 19, 23]]) If the array has a rank of less than 2, we will just get a view of the array: In: b.ndim Out: 1 In: b.T Out: array([0, 1, 2, 3, 4]) Complex numbers in NumPy are represented by j. For instance, we can produce an array with complex numbers as follows: In: b = np.array([1.j + 1, 2.j + 3]) In: b Out: array([ 1.+1.j, 3.+2.j]) The real property returns to us the real part of the array, or the array itself if it only holds real numbers: In: b.real Out: array([ 1., 3.]) The imag property holds the imaginary part of the array: In: b.imag Out: array([ 1., 2.]) If the array holds complex numbers, then the data type will automatically be complex as well: In: b.dtype Out: dtype('complex128') In: b.dtype.str Out: '<c16' The flat property gives back a numpy.flatiter object. This is the only means to get a flatiter object; we do not have access to a flatiter constructor. The flat iterator enables us to loop through an array as if it were a flat array, as shown in the following code snippet: In: b = np.arange(4).reshape(2,2) In: b Out: array([[0, 1], [2, 3]]) In: f = b.flat In: f Out: <numpy.flatiter object at 0x103013e00> In: for item in f: print(item) Out: 0 1 2 3 It is possible to straightaway obtain an element with the flatiter object: In: b.flat[2] Out: 2 Also, you can obtain multiple elements as follows: In: b.flat[[1,3]] Out: array([1, 3]) The flat property can be set. Setting the value of the flat property leads to overwriting the values of the entire array: In: b.flat = 7 In: b Out: array([[7, 7], [7, 7]]) We can also obtain selected elements as follows: In: b.flat[[1,3]] = 1 In: b Out: array([[7, 1], [7, 1]]) The next diagram illustrates various properties of ndarray: Converting arrays We can convert a NumPy array to a Python list with the tolist() function . The following is a brief explanation: Convert to a list: In: b Out: array([ 1.+1.j, 3.+2.j]) In: b.tolist() Out: [(1+1j), (3+2j)] The astype() function transforms the array to an array of the specified data type: In: b Out: array([ 1.+1.j, 3.+2.j]) In: b.astype(int) /usr/local/lib/python3.5/site-packages/ipykernel/__main__.py:1: ComplexWarning: Casting complex values to real discards the imaginary part … Out: array([1, 3]) In: b.astype('complex') Out: array([ 1.+1.j, 3.+2.j]) We are dropping off the imaginary part when casting from the complex type to int. The astype() function takes the name of a data type as a string too. The preceding code won't display a warning this time because we used the right data type. Summary In this article, we found out a heap about the NumPy basics: data types and arrays. Arrays have various properties that describe them. You learned that one of these properties is the data type, which, in NumPy, is represented by a full-fledged object. NumPy arrays can be sliced and indexed in an effective way, compared to standard Python lists. NumPy arrays have the extra ability to work with multiple dimensions. The shape of an array can be modified in multiple ways, such as stacking, resizing, reshaping, and splitting. Resources for Article: Further resources on this subject: Big Data Analytics [article] Python Data Science Up and Running [article] R and its Diverse Possibilities [article]

0
0
40385

article-image-getting-started-salesforce-lightning-experience

Packt

02 Mar 2017

8 min read

Getting Started with Salesforce Lightning Experience

Packt

02 Mar 2017

8 min read

0
0
20036

Packt

02 Mar 2017

13 min read

Functions with Arduino

Packt

02 Mar 2017

13 min read

In this article by Syed Omar Faruk Towaha, the author of the book Learning C for Arduino, we will learn about functions and file handling with Arduino. We learned about loops and conditions. Let’s begin out journey into Functions with Arduino. (For more resources related to this topic, see here.) Functions Do you know how to make instant coffee? Don’t worry; I know. You will need some water, instant coffee, sugar, and milk or creamer. Remember, we want to drink coffee, but we are doing something that makes coffee. This procedure can be defined as a function of coffee making. Let’s finish making coffee now. The steps can be written as follows: Boil some water. Put some coffee inside a mug. Add some sugar. Pour in boiled water. Add some milk. And finally, your coffee is ready! Let’s write a pseudo code for this function: function coffee (water, coffee, sugar, milk) { add coffee beans; add sugar; pour boiled water; add milk; return coffee; } In our pseudo code we have four items (we would call them parameters) to make coffee. We did something with our ingredients, and finally, we got our coffee, right? Now, if anybody wants to get coffee, he/she will have to do the same function (in programming, we will call it calling a function) again. Let’s move into the types of functions. Types of functions A function returns something, or a value, which is called the return value. The return values can be of several types, some of which are listed here: Integers Float Double Character Void Boolean Another term, argument, refers to something that is passed to the function and calculated or used inside the function. In our previous example, the ingredients passed to our coffee-making process can be called arguments (sugar, milk, and so on), and we finally got the coffee, which is the return value of the function. By definition, there are two types of functions. They are, a system-defined function and a user-defined function. In our Arduino code, we have often seen the following structure: void setup() { } void loop() { } setup() and loop() are also functions. The return type of these functions is void. Don’t worry, we will discuss the type of function soon. The setup() and loop() functions are system-defined functions. There are a number of system-defined functions. The user-defined functions cannot be named after them. Before going deeper into function types, let’s learn the syntax of a function. Functions can be as follows: void functionName() { //statements } Or like void functionName(arg1, arg2, arg3) { //statements } So, what’s the difference? Well, the first function has no arguments, but the second function does. There are four types of function, depending on the return type and arguments. They are as follows: A function with no arguments and no return value A function with no arguments and a return value A function with arguments and no return value A function with arguments and a return value Now, the question is, can the arguments be of any type? Yes, they can be of any type, depending on the function. They can be Boolean, integers, floats, or characters. They can be a mixture of data types too. We will look at some examples later. Now, let’s define and look at examples of the four types of function we just defined. Function with no arguments and no return value, these functions do not accept arguments. The return type of these functions is void, which means the function returns nothing. Let me clear this up. As we learned earlier, a function must be named by something. The naming of a function will follow the rule for the variable naming. If we have a name for a function, we need to define its type also. It’s the basic rule for defining a function. So, if we are not sure of our function’s type (what type of job it will do), then it is safe to use the void keyword in front of our function, where void means no data type, as in the following function: void myFunction(){ //statements } Inside the function, we may do all the things we need. Say we want to print I love Arduino! ten times if the function is called. So, our function must have a loop that continues for ten times and then stops. So, our function can be written as follows: void myFunction() { int i; for (i = 0; i < 10; i++) { Serial.println(“I love Arduino!“); } } The preceding function does not have a return value. But if we call the function from our main function (from the setup() function; we may also call it from the loop() function, unless we do not want an infinite loop), the function will print I love Arduino! ten times. No matter how many times we call, it will print ten times for each call. Let’s write the full code and look at the output. The full code is as follows: void myFunction() { int i; for (i = 0; i < 10; i++) { Serial.println(“I love Arduino!“); } } void setup() { Serial.begin(9600); myFunction(); // We called our function Serial.println(“................“); //This will print some dots myFunction(); // We called our function again } void loop() { // put your main code here, to run repeatedly: } In the code, we placed our function (myFunction) after the loop() function. It is a good practice to declare the custom function before the setup() loop. Inside our setup() function, we called the function, then printed a few dots, and finally, we called our function again. You can guess what will happen. Yes, I love Arduino! will be printed ten times, then a few dots will be printed, and finally, I love Arduino! will be printed ten times. Let’s look at the output on the serial monitor: Yes. Your assumption is correct! Function with no arguments and a return value In this type of function, no arguments are passed, but they return a value. You need to remember that the return value depends on the type of the function. If you declare a function as an integer function, the return value’s type will have to have be an integer also. If you declare a function as a character, the return type must be a character. This is true for all other data types as well. Let’s look at an example. We will declare an integer function, where we will define a few integers. We will add them and store them to another integer, and finally, return the addition. The function may look as follows: int addNum() { int a = 3, b = 5, c = 6, addition; addition = a + b + c; return addition; } The preceding function should return 14. Let’s store the function’s return value to another integer type of variable in the setup() function and print in on the serial monitor. The full code will be as follows: void setup() { Serial.begin(9600); int fromFunction = addNum(); // added values to an integer Serial.println(fromFunction); // printed the integer } void loop() { } int addNum() { int a = 3, b = 5, c = 6, addition; //declared some integers addition = a + b + c; // added them and stored into another integers return addition; // Returned the addition. } The output will look as follows: Function with arguments and no return value This type of function processes some arguments inside the function, but does not return anything directly. We can do the calculations inside the function, or print something, but there will be no return value. Say we need find out the sum of two integers. We may define a number of variables to store them, and then print the sum. But with the help of a function, we can just pass two integers through a function; then, inside the function, all we need to do is sum them and store them in another variable. Then we will print the value. Every time we call the function and pass our values through it, we will get the sum of the integers we pass. Let’s define a function that will show the sum of the two integers passed through the function. We will call the function sumOfTwo(), and since there is no return value, we will define the function as void. The function should look as follows: void sumOfTwo(int a, int b) { int sum = a + b; Serial.print(“The sum is “ ); Serial.println(sum); } Whenever we call this function with proper arguments, the function will print the sum of the number we pass through the function. Let’s look at the output first; then we will discuss the code: We pass the arguments to a function, separating them with commas. The sequence of the arguments must not be messed up while we call the function. Because the arguments of a function may be of different types, if we mess up while calling, the program may not compile and will not execute correctly: Say a function looks as follows: void myInitialAndAge(int age, char initial) { Serial.print(“My age is “); Serial.println(age); Serial.print(“And my initial is “); Serial.print(initial); } Now, we must call the function like so: myInitialAndAge(6,’T’); , where 6 is my age and T is my initial. We should not do it as follows: myInitialAndAge(‘T’, 6);. We called the function and passed two values through it (12 and 24). We got the output as The sum is 36. Isn’t it amazing? Let’s go a little bit deeper. In our function, we declared our two arguments (a and b) as integers. Inside the whole function, the values (12 and 24) we passed through the function are as follows: a = 12 and b =24; If we called the function this sumOfTwo(24, 12), the values of the variables would be as follows: a = 24 and b = 12; I hope you can now understand the sequence of arguments of a function. How about an experiment? Call the sumOfTwo() function five times in the setup() function, with different values of a and b, and compare the outputs. Function with arguments and a return value This type of function will have both the arguments and the return value. Inside the function, there will be some processing or calculations using the arguments, and later, there would be an outcome, which we want as a return value. Since this type of function will return a value, the function must have a type. Let‘s look at an example. We will write a function that will check if a number is prime or not. From your math class, you may remember that a prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. The basic logic behind checking whether a number is prime or not is to check all the numbers starting from 2 to the number before the number itself by dividing the number. Not clear? Ok, let’s check if 9 is a prime number. No, it is not a prime number. Why? Because it can be divided by 3. And according to the definition, the prime number cannot be divisible by any number other than 1 and the number itself. So, we will check if 9 is divisible by 2. No, it is not. Then we will divide by 3 and yes, it is divisible. So, 9 is not a prime number, according to our logic. Let’s check if 13 is a prime number. We will check if the number is divisible by 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12. No, the number is not divisible by any of those numbers. We may also shorten our checking by only checking the number that is half of the number we are checking. Look at the following code: int primeChecker(int n) // n is our number which will be checked { int i; // Driver variable for the loop for (i = 2; i <= n / 2; i++) // Continued the loop till n/2 { if (n % i == 0) // If no reminder return 1; // It is not prime } return 0; // else it is prime } The code is quite simple. If the remainder is equal to zero, our number is fully divisible by a number other than 1 and itself, so it is not a prime number. If there is any remainder, the number is a prime number. Let’s write the full code and look at the output for the following numbers: 23 65 235 4,543 4,241 The full source code to check if the numbers are prime or not is as follows: void setup() { Serial.begin(9600); primeChecker(23); // We called our function passing our number to test. primeChecker(65); primeChecker(235); primeChecker(4543); primeChecker(4241); } void loop() { } int primeChecker(int n) { int i; //driver variable for the loop for (i = 2; i <= n / 2; ++i) //loop continued until the half of the numebr { if (n % i == 0) return Serial.println(“Not a prime number“); // returned the number status } return Serial.println(“A prime number“); } This is a very simple code. We just called our primeChecker() function and passed our numbers. Inside our primeChecker() function, we wrote the logic to check our number. Now let’s look at the output: From the output, we can see that, other than 23 and 4,241, none of the numbers are prime. Let’s look at an example, where we will write four functions: add(), sub(), mul(), and divi(). Into these functions, we will pass two numbers, and print the value on the serial monitor. The four functions can be defined as follows: float sum(float a, float b) { float sum = a + b; return sum; } float sub(float a, float b) { float sub = a - b; return sub; } float mul(float a, float b) { float mul = a * b; return mul; } float divi(float a, float b) { float divi = a / b; return divi; } Now write the rest of the code, which will give the following outputs: Usages of functions You may wonder which type of function we should use. The answer is simple. The usages of the functions are dependent on the operations of the programs. But whatever the function is, I would suggest to do only a single task with a function. Do not do multiple tasks inside a function. This will usually speed up your processing time and the calculation time of the code. You may also want to know why we even need to use functions. Well, there a number of uses of functions, as follows: Functions help programmers write more organized code Functions help to reduce errors by simplifying code Functions make the whole code smaller Functions create an opportunity to use the code multiple times Exercise To extend your knowledge of functions, you may want to do the following exercise: Write a program to check if a number is even or odd. (Hint: you may remember the % operator). Write a function that will find the largest number among four numbers. The numbers will be passed as arguments through the function. (Hint: use if-else conditions). Suppose you work in a garment factory. They need to know the area of a cloth. The area can be in float. They will provide you the length and height of the cloth. Now write a program using functions to find out the area of the cloth. (Use basic calculation in the user-defined function). Summary This article gave us a hint about the functions that Arduino perform. With this information we can create even more programs that is supportive in nature with Arduino. Resources for Article: Further resources on this subject: Connecting Arduino to the Web [article] Getting Started with Arduino [article] Arduino Development [article]

0
0
38179

How-To Tutorials

article-image-preparing-initial-two-nodes

Packt

02 Mar 2017

8 min read

Preparing the Initial Two Nodes

Packt

02 Mar 2017

8 min read

0
0
17012

How-To Tutorials

article-image-functional-building-blocks

Packt

02 Mar 2017

3 min read

Functional Building Blocks

Packt

02 Mar 2017

3 min read

In this article by Atul Khot, author of the book, Learning Functional Data Structures and Algorithms provides refresher on some fundamentals concepts. How fast could an algorithm run? How does it fare when you have ten input elements versus a million? To answer such questions, we need to be aware of the notion of algorithmic complexity, which is expressed using the Big O notation. An O(1) algorithm is faster than O(logn), for example. What is this notation? It talks about measuring the efficiency of an algorithm, which is proportional to the number of data items, N, being processed. (For more resources related to this topic, see here.) The Big O notation In simple words, this notation is used to describe how fast an algorithm will run. It describes the growth of the algorithm's running time versus the size of input data. Here is a simple example. Consider the following Scala snippet, reversing a linked list: scala> def revList(list: List[Int]): List[Int] = list match { | case x :: xs => revList(xs) ++ List(x) | case Nil => Nil | } revList: (list: List[Int])List[Int] scala> revList(List(1,2,3,4,5,6)) res0: List[Int] = List(6, 5, 4, 3, 2, 1) A quick question for you: how many times the first case clause, namely case x :: xs => revList(xs) ++ List(x), matches a list of six elements? Note that the clause matches when the list is non-empty. When it matches, we reduce the list by one element and recursively call the method. It is easy to see the clause matches six times. As a result, the list method, ++, also gets invoked four times. The ++ method takes time and is directly proportional to the length of the list on left-hand side. Here is a plot of the number of iterations against time: To reverse a list of nine elements, we iterate over the elements 55 times (9+8+7+6+5+4+3+2+1). For a list with 20 elements, the number of iterations are 210. Here is a table showing some more example values: The number of iterations are proportional to n2/2. It gives us an idea of how the algorithm runtime grows for a given number of elements. The moment we go from 100 to 1,000 elements, the algorithm needs to do 500 times more work. Another example of a quadratic runtime algorithm is a selection sorting. It keeps finding the minimum and increases the sorted sublist. The algorithm keeps scanning the list for the next minimum and hence always has O(n2) complexity; this is because it does O(n2) comparisons. Refer to http://www.studytonight.com/data-structures/selection-sorting for more information. Binary search is a very popular search algorithm with a complexity of O(logn). The succeeding figure shows the growth table for O(logn). When the number of input elements jumps from 256 to 10,48,576, the growth is from 8 to 20. Binary search is a blazing fast algorithm as the runtime grows marginally when the number of input elements jump from a couple hundreds to hundred thousand. Refer to https://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/ for an excellent description of the O notation. Refer to the following link that has a graphical representation of the various growth functions: https://therecyclebin.files.wordpress.com/2008/05/time-complexity.png Summary This article was a whirlwind tour of the basics. We started with a look at the Big O notation, which is used to reason about how fast an algorithm could run FP programs use collections heavily. We looked at some common collection operations and their complexities. Resources for Article: Further resources on this subject: Algorithm Analysis [article] Introducing Algorithm Design Paradigms [article] Getting Started with Sorting Algorithms in Java [article]

0
1
2462

Packt

02 Mar 2017

7 min read

Developer Workflow

Packt

02 Mar 2017

7 min read

In this article by Chaz Chumley and William Hurley, the author of the book Mastering Drupal 8, we will to decide on a local AMP stack and the role of a Composer. (For more resources related to this topic, see here.) Deciding on a local AMP stack Any developer workflow begins with having an AMP (Apache, MySQL, PHP) stack installed and configured on a Windows, OSX, or *nix based machine. Depending on the operating system, there are a lot of different methods that one can take to setup an ideal environment. However, when it comes down to choices there are really only three: Native AMP stack: This option refers to systems that generally either come preconfigured with Apache, MySQL, and PHP or have a generally easy install path to download and configure these three requirements. There are plenty of great tutorials on how to achieve this workflow but this does require familiarity with the operating system. Packaged AMP stacks: This option refers to third-party solutions such as MAMP—https://www.mamp.info/en/, WAMP—http://www.wampserver.com/en/, or Acquia Dev Desktop—https://dev.acquia.com/downloads. These solutions come with an installer that generally works on Windows and OSX and is a self-contained AMP stack allowing for general web server development. Out of these three only Acquia Dev Desktop is Drupal specific. Virtual machine: This option is often the best solution as it closely represents the actual development, staging, and production web servers. However, this can also be the most complex to initially setup and requires some knowledge of how to configure specific parts of the AMP stack. That being said, there are a few really well documented VM’s available that can help reduce the experience needed. Two great virtual machines to look at are Drupal VM—https://www.drupalvm.com/ and Vagrant Drupal Development (VDD)—https://www.drupal.org/project/vdd. In the end, my recommendation is to choose an environment that is flexible enough to quickly install, setup, and configure Drupal instances. The above choices are all good to start with, and by no means is any single solution a bad choice. If you are a single person developer, then a packaged AMP stack such as MAMP may be the perfect choice. However, if you are in a team environment I would strongly recommend one of the VM options above or look into creating your own VM environment that can be distributed to your team. We will discuss virtualized environments in more detail, but before we do, we need to have a basic understanding of how to work with three very important command line interfaces. Composer, Drush, and Drupal Console. The role of Composer Drupal 8 and each minor version introduces new features and functionality. Everything from moving the most commonly used 3rd party modules into its core to the introduction of an object oriented PHP framework. These improvements also introduced the Symfony framework which brings along the ability to use a dependency management tool called Composer. Composer (https://getcomposer.org/) is a dependency manager for PHP that allows us to perform a multitude of tasks. Everything from creating a Drupal project to declaring libraries and even installing contributed modules just to name a few. The advantage to using Composer is that it allows us to quickly install and update dependencies by simply running a few commands. These configurations are then stored within a composer.json file that can be shared with other developers to quickly setup identical Drupal instances. If you are new to Composer then let’s take a moment to discuss how to go about installing Composer for the first time within a local environment. Installing Composer locally Composer can be installed on Windows, Linux, Unix, and OSX. For this example, we will be following the install found at https://getcomposer.org/download/. Make sure to take a look at the Getting Started documentation that corresponds with your operating system. Begin by opening a new terminal window. By default, our terminal window should place us in the user directory. We can then continue by executing the following four commands: Download Composer installer to local directory: php -r "copy('https://getcomposer.org/installer', 'composer-setup.php');" Verify the installer php -r "if (hash_file('SHA384', 'composer-setup.php') === 'e115a8dc7871f15d853148a7fbac7da27d6c0030b848d9b3dc09e2a0388afed865e6a3d6b3c0fad45c48e2b5fc1196ae') { echo 'Installer verified'; } else { echo 'Installer corrupt'; unlink('composer-setup.php'); } echo PHP_EOL;" Since Composer versions are often updated it is important to refer back to these directions to ensure the hash file above is the most current one. Run the installer: php composer-setup.php Remove the installer: php -r "unlink('composer-setup.php');" Composer is now installed locally and we can verify this by executing the following command within a terminal window: php composer.phar Composer should now present us with a list of available commands: The challenge with having Composer installed locally is that it restricts us from using it outside the current user directory. In most cases, we will be creating projects outside of our user directory, so having the ability to globally use Composer quickly becomes a necessity. Installing Composer globally Moving the composer.phar file from its current location to a global directory can be achieved by executing the following command within a terminal window: mv composer.phar /usr/local/bin/composer We can now execute Composer commands globally by typing composer in the terminal window. Using Composer to create a Drupal project One of the most common uses for Composer is the ability to create a PHP project. The create-project command takes several arguments including the type of PHP project we want to build, the location of where we want to install the project, and optionally, the package version. Using this command, we no longer need to manually download Drupal and extract the contents into an install directory. We can speed up the entire process by using one simple command. Begin by opening a terminal window and navigating to a folder where we want to install Drupal. Next we can use Composer to execute the following command: composer create-project drupal/drupal mastering The create-project command tells Composer that we want to create a new Drupal project within a folder called mastering. Once the command is executed, Composer locates the current version of Drupal and installs the project along with any additional dependencies that it needs During the install process Composer will prompt us to remove any existing version history that the Drupal repository stores. It is generally a good idea to choose yes to remove these files as we will be wanting to manage our own repository with our own version history. Once Composer has completed creating our Drupal project, we can navigate to the mastering folder and review the composer.json file the Composer creates to store project specific configurations. As soon as the composer.json file is created our Drupal project can be referred to as a package. We can version the file, distribute it to a team, and they can run composer install to generate an identical Drupal 8 code base. With Composer installed globally we can take a look at another command line tool that will assist us with making Drupal development much easier. Summary In this article, you learned about how to decide on a local AMP stack and how to install a composer both locally and globally. Also we saw a bit about how to use Composer to create a Drupal project. Resources for Article: Further resources on this subject: Drupal 6 Performance Optimization Using Throttle and Devel Module [article] Product Cross-selling and Layout using Panels with Drupal and Ubercart 2.x [article] Setting up an online shopping cart with Drupal and Ubercart [article]

0
0
42888

article-image-lightweight-messaging-mqtt-311-and-mosquitto

Packt

01 Mar 2017

10 min read

Lightweight messaging with MQTT 3.1.1 and Mosquitto

Packt

01 Mar 2017

10 min read

In this article by Gastón C. Hillar, author of the book, MQTT Essentials, we will start our journey towards the usage of the preferred IoT publish-subscribe lightweight messaging protocol in diverse IoT solutions combined with mobile apps and Web applications. We will learn how MQTT and its lightweight messaging system work. We will learn MQTT basics, the specific vocabulary for MQTT, and its working modes. We will use different utilities and diagrams to understand the most important concepts related to MQTT. We will: Understand convenient scenarios for the MQTT protocol Work with the publish-subscribe pattern Work with message filtering (For more resources related to this topic, see here.) Understanding convenient scenarios for the MQTT protocol Imagine that we have dozens of different devices that must exchange data between them. These devices have to request data to other devices and the devices that receive the requests must respond with the demanded data. The devices that requested the data must process the data received from the devices that responded with the demanded data. The devices are IoT (Internet of Things) boards that have dozens of sensors wired to them. We have the following IoT boards with different processing powers: Raspberry Pi 3, Raspberry Pi Model B, Intel Edison, and Intel Joule 570x. Each of these boards has to be able to send and receive data. In addition, we want many mobile devices to be able to send and receive data, some of them running iOS and others Android. We have to work with many programming languages. We want to send and receive data in near real time through the Internet and we might face some network problems, that is, our wireless networks are somewhat unreliable and we have some high-latency environments. Some devices have low power, many of them are powered by batteries and their resources are scarce. In addition, we must be careful with the network bandwidth usage because some of the devices use metered connections. A metered connection is a network connection in which we have a limited amount of data usage per month. If we go over this amount of data, we get billed extra charges. We can use HTTP requests and a build a publish-subscribe model to exchange data between different devices. However, there is a protocol that has been specifically designed to be lighter than the HTTP 1.1 protocol and work better when unreliable networks are involved and connectivity is intermittent. The MQTT (short for MQ Telemetry Transport) is better suited for this scenario in which many devices have to exchange data between themselves in near real time through the Internet and we need to consume the least possible network bandwidth. The MQTT protocol is an M2M (Machine-to-Machine) and IoT connectivity protocol. MQTT is a lightweight messaging protocol that works with a broker-based publish-subscribe mechanism and runs on top of TCP/IP (Transmission Control Protocol/Internet Protocol). The following diagram shows the MQTT protocol on top of the TCP/IP stack: The most popular versions of MQTT are 3.1 and 3.1.1. In this article, we will work with MQTT 3.1.1. Whenever we reference MQTT, we are talking about MQTT 3.1.1, that is, the newest version of the protocol. The MQTT 3.1.1 specification has been standardised by the OASIS consortium. In addition, MQTT 3.1.1 became an ISO standard (ISO/IEC 20922) in 2016. MQTT is lighter than the HTTP 1.1 protocol, and therefore, it is a very interesting option whenever we have to send and receive data in near real time with a publish-subscribe model while requiring the lowest possible footprint. MQTT is very popular in IoT, M2M, and embedded projects, but it is also gaining presence in web applications and mobile apps that require assured messaging and an efficient message distribution. As a summary, MQTT is suitable for the following application domains in which data exchange is required: Asset tracking and management Automotive telematics Chemical detection Environment and traffic monitoring Field force automation Fire and gas testing Home automation IVI (In-Vehicle Infotainment) Medical POS (Point of Sale) kiosks Railway RFID (Radio-Frequency Identification) SCADA (Supervisory Control and Data Acquisition) Slot machines As a summary, MQTT was designed to be suitable to support the following typical challenges in IoT, M2M, embedded, and mobile applications: Be lightweight to make it possible to transmit high volumes of data without huge overheads Distribute minimal packets of data in huge volumes Support an event-oriented paradigm with asynchronous bidirectional low latency push delivery of messages Easily emit data from one client to many clients Make it possible to listen for events whenever they happen (event-oriented architecture) Support always-connected and sometimes-connected models Publish information over unreliable networks and provide reliable deliveries over fragile connections Work very well with battery-powered devices or require low power consumption Provide responsiveness to make it possible to achieve near real-time delivery of information Offer security and privacy for all the data Be able to provide the necessary scalability to distribute data to hundreds of thousands of clients Working with the publish-subscribe pattern Before we dive deep into MQTT, we must understand the publish-subscribe pattern, also known as the pub-sub pattern. In the publish-subscribe pattern, a client that publishes a message is decoupled from the other client or clients that receive the message. The clients don’t know about the existence of the other clients. A client can publish messages of a specific type and only the clients that are interested in specific types of messages will receive the published messages. The publish-subscribe pattern requires a broker, also known as server. All the clients establish a connection with the broker. The client that sends a message through the broker is known as the publisher. The broker filters the incoming messages and distributes them to the clients that are interested in the type of received messages. The clients that register to the broker as interested in specific types of messages are known as subscribers. Hence, both publishers and subscribers establish a connection with the broker. It is easy to understand how things work with a simple diagram. The following diagram shows one publisher and two subscribers connected to a broker: A Raspberry Pi 3 board with an altitude sensor wired to it is a publisher that establishes a connection with the broker. An iOS smartphone and an Android tablet are two subscribers that establish a connection with the broker. The iOS smartphone indicates the broker that it wants to subscribe to all the messages that belong to the sensor1/altitude topic. The Android tablet indicates the same to the broker. Hence, both the iOS smartphone and the Android tablet are subscribed to the sensor1/altitude topic. A topic is a named logical channel and it is also referred to as a channel or subject. The broker will send publishers only the messages published to topics to which they are subscribed. The Raspberry Pi 3 board publishes a message with 100 feet as the payload and sensor1/altitude as the topic. The board, that is, the publisher, sends the publish request to the broker. The data for a message is known as payload. A message includes the topic to which it belongs and the payload. The broker distributes the message to the two clients that are subscribed to the sensor1/altitude topic: the iOS smartphone and the Android tablet. Publishers and subscribers are decoupled in space because they don’t know each other. Publishers and subscribers don’t have to run at the same time. The publisher can publish a message and the subscriber can receive it later. In addition, the publish operation isn’t synchronized with the receive operation. A publisher requests the broker to publish a message and the different clients that have subscribed to the appropriate topic can receive the message at different times. The publisher can perform asynchronous requests to avoid being blocked until the clients receive the messages. However, it is also possible to perform a synchronous request to the broker and to continue the execution only after the request was successful. In most cases, we will want to take advantage of asynchronous requests. A publisher that requires sending a message to hundreds of clients can do it with a single publish operation to a broker. The broker is responsible of sending the published message to all the clients that have subscribed to the appropriate topic. Because publishers and subscribers are decoupled, the publisher doesn’t know whether there is any subscriber that is going to listen to the messages it is going to send. Hence, sometimes it is necessary to make the subscriber become a publisher too and to publish a message indicating that it has received and processed a message. The specific requirements depend on the kind of solution we are building. MQTT offers many features that make our lives easier in many of the scenarios we have been analyzing. Working with message filtering The broker has to make sure that subscribers only receive the messages they are interested in. It is possible to filter messages based on different criteria in a publish-subscribe pattern. We will focus on analyzing topic-based filtering, also known as subject-based filtering. Consider that each message belongs to a topic. When a publisher requests the broker to publish a message, it must specify both the topic and the message. The broker receives the message and delivers it to all the subscribers that have subscribed to the topic to which the message belongs. The broker doesn’t need to check the payload for the message to deliver it to the corresponding subscribers, it just needs to check the topic for each message that has arrived and needs to be filtered before publishing it to the corresponding subscribers. A subscriber can subscribe to more than one topic. In this case, the broker has to make sure that the subscriber receives the messages that belong to all the topics to which it has subscribed. It is easy to understand how things work with another simple diagram. The following diagram shows two future publishers that haven’t published any message yet, a broker and two subscribers connected to the broker: A Raspberry Pi 3 board with an altitude sensor wired to it and an Intel Edison board with a temperature sensor wired to it will be two publishers. An iOS smartphone and an Android tablet are two subscribers that establish a connection to the broker. The iOS smartphone indicates the broker that it wants to subscribe to all the messages that belong to the sensor1/altitude topic. The Android tablet indicates the broker that it wants to subscribe to all the messages that belong to any of the following two topics: sensor1/altitude and sensor42/temperature. Hence, the Android tablet is subscribed to two topics while the iOS smartphone is subscribed to just one topic. The following diagram shows what happens after the two publishers connect and publish messages to different topics through the broker: The Raspberry Pi 3 board publishes a message with 120 feet as the payload and sensor1/altitude as the topic. The board, that is, the publisher, sends the publish request to the broker. The broker distributes the message to the two clients that are subscribed to the sensor1/altitude topic: the iOS smartphone and the Android tablet. The Intel Edison board publishes a message with 75 F as the payload and sensor42/temperature as the topic. The board, that is, the publisher, sends the publish request to the broker. The broker distributes the message to the only client that is subscribed to the sensor42/temperature topic: the Android Tablet. Thus, the Android tablet receives two messages. Summary In this article, we started our journey towards the MQTT protocol. We understood convenient scenarios for this protocol, the details of the publish-subscribe pattern and message filtering. We learned basic concepts related to MQTT and understood the different components: clients, brokers, and connections. Resources for Article: Further resources on this subject: All About the Protocol [article] Analyzing Transport Layer Protocols [article] IoT and Decision Science [article]

0
0
15208

How-To Tutorials

Packt

01 Mar 2017

17 min read

Understanding Spark RDD

Packt

01 Mar 2017

17 min read

In this article by Asif Abbasi author of the book Learning Apache Spark 2.0, we will understand Spark RDD along with that we will learn, how to construct RDDs, Operations on RDDs, Passing functions to Spark in Scala, Java, and Python and Transformations such as map, filter, flatMap, and sample. (For more resources related to this topic, see here.) What is an RDD? What’s in a name might be true for a rose, but perhaps not for an Resilient Distributed Datasets (RDD), and in essence describes what an RDD is. They are basically datasets, which are distributed across a cluster (remember Spark framework is inherently based on an MPP architecture), and provide resilience (automatic failover) by nature. Before we go into any further detail, let’s try to understand this a little bit, and again we are trying to be as abstract as possible. Let us assume that you have a sensor data from aircraft sensors and you want to analyze the data irrespective of its size and locality. For example, an Airbus A350 has roughly 6000 sensors across the entire plane and generates 2.5 TB data per day, while the newer model expected to launch in 2020 will generate roughly 7.5 TB per day. From a data engineering point of view, it might be important to understand the data pipeline, but from an analyst and a data scientist point of view, your major concern is to analyze the data irrespective of the size and number of nodes across which it has been stored. This is where the neatness of the RDD concept comes into play, where the sensor data can be encapsulated as an RDD concept, and any transformation/action that you perform on the RDD applies across the entire dataset. Six month's worth of dataset for an A350 would be approximately 450 TBs of data, and would need to sit across multiple machines. For the sake of discussion, we assume that you are working on a cluster of four worker machines. Your data would be partitioned across the workers as follows: Figure 2-1: RDD split across a cluster The figure basically explains that an RDD is a distributed collection of the data, and the framework distributes the data across the cluster. Data distribution across a set of machines brings its own set of nuisances including recovering from node failures. RDD’s are resilient as they can be recomputed from the RDD lineage graph, which is basically a graph of the entire parent RDDs of the RDD. In addition to the resilience, distribution, and representing a data set, an RDD has various other distinguishing qualities: In Memory: An RDD is a memory resident collection of objects. We’ll look at options where an RDD can be stored in memory, on disk, or both. However, the execution speed of Spark stems from the fact that the data is in memory, and is not fetched from disk for each operation. Partitioned: A partition is a division of a logical dataset or constituent elements into independent parts. Partitioning is a defacto performance optimization technique in distributed systems to achieve minimal network traffic, a killer for high performance workloads. The objective of partitioning in key-value oriented data is to collocate similar range of keys and in effect minimize shuffling. Data inside RDD is split into partitions and across various nodes of the cluster. Typed: Data in an RDD is strongly typed. When you create an RDD, all the elements are typed depending on the data type. Lazy evaluation: The transformations in Spark are lazy, which means data inside RDD is not available until you perform an action. You can, however, make the data available at any time using a count() action on the RDD. We’ll discuss this later and the benefits associated with it. Immutable: An RDD once created cannot be changed. It can, however, be transformed into a new RDD by performing a set of transformations on it. Parallel: An RDD is operated on in parallel. Since the data is spread across a cluster in various partitions, each partition is operated on in parallel. Cacheable: Since RDD’s are lazily evaluated, any action on an RDD will cause the RDD to revaluate all transformations that led to the creation of RDD. This is generally not a desirable behavior on large datasets, and hence Spark allows the option to persist the data on memory or disk. A typical Spark program flow with an RDD includes: Creation of an RDD from a data source. A set of transformations, for example, filter, map, join, and so on. Persisting the RDD to avoid re-execution. Calling actions on the RDD to start performing parallel operations across the cluster. This is depicted in the following figure: Figure 2-2: Typical Spark RDD flow Operations on RDD Two major operation types can be performed on an RDD. They are called: Transformations Actions Transformations Transformations are operations that create a new dataset, as RDDs are immutable. They are used to transform data from one to another, which could result in amplification of the data, reduction of the data, or a totally different shape altogether. These operations do not return any value back to the driver program, and hence lazily evaluated, which is one of the main benefits of Spark. An example of a transformation would be a map function that will pass through each element of the RDD and return a totally new RDD representing the results of application of the function on the original dataset. Actions Actions are operations that return a value to the driver program. As previously discussed, all transformations in Spark are lazy, which essentially means that Spark remembers all the transformations carried out on an RDD, and applies them in the most optimal fashion when an action is called. For example, you might have a 1 TB dataset, which you pass through a set of map functions by applying various transformations. Finally, you apply the reduce action on the dataset. Apache Spark will return only a final dataset, which might be few MBs rather than the entire 1 TB dataset of mapped intermediate result. You should, however, remember to persist intermediate results otherwise Spark will recompute the entire RDD graph each time an Action is called. The persist() method on an RDD should help you avoid recomputation and saving intermediate results. We’ll look at this in more detail later. Let’s illustrate the work of transformations and actions by a simple example. In this specific example, we’ll be using flatmap() transformations and a count action. We’ll use the README.md file from the local filesystem as an example. We’ll give a line-by-line explanation of the Scala example, and then provide code for Python and Java. As always, you must try this example with your own piece of text and investigate the results: //Loading the README.md file val dataFile = sc.textFile(“README.md”) Now that the data has been loaded, we’ll need to run a transformation. Since we know that each line of the text is loaded as a separate element, we’ll need to run a flatMap transformation and separate out individual words as separate elements, for which we’ll use the split function and use space as a delimiter: //Separate out a list of words from individual RDD elements val words = dataFile.flatMap(line => line.split(“ “)) Remember that until this point, while you seem to have applied a transformation function, nothing has been executed and all the transformations have been added to the logical plan. Also note that the transformation function returns a new RDD. We can then call the count() action on the words RDD, to perform the computation, which then results in fetching of data from the file to create an RDD, before applying the transformation function specified. You might note that we have actually passed a function to Spark: //Separate out a list of words from individual RDD elements Words.count() Upon calling the count() action the RDD is evaluated, and the results are sent back to the driver program. This is very neat and especially useful during big data applications. If you are Python savvy, you may want to run the following code in PySpark. You should note that lambda functions are passed to the Spark framework: //Loading data file, applying transformations and action dataFile = sc.textFile("README.md") words = dataFile.flatMap(lambda line: line.split(" ")) words.count() Programming the same functionality in Java is also quite straight forward and will look pretty similar to the program in Scala: JavaRDD<String> lines = sc.textFile("README.md"); JavaRDD<String> words = lines.map(line -> line.split(“ “)); int wordCount = words.count(); This might look like a simple program, but behind the scenes it is taking the line.split(“ ”) function and applying it to all the partitions in the cluster in parallel. The framework provides this simplicity and does all the background work of coordination to schedule it across the cluster, and get the results back. Passing functions to Spark (Scala) As you have seen in the previous example, passing functions is a critical functionality provided by Spark. From a user’s point of view you would pass the function in your driver program, and Spark would figure out the location of the data partitions across the cluster memory, running it in parallel. The exact syntax of passing functions differs by the programming language. Since Spark has been written in Scala, we’ll discuss Scala first. In Scala, the recommended ways to pass functions to the Spark framework are as follows: Anonymous functions Static singleton methods Anonymous functions Anonymous functions are used for short pieces of code. They are also referred to as lambda expressions, and are a cool and elegant feature of the programming language. The reason they are called anonymous functions is because you can give any name to the input argument and the result would be the same. For example, the following code examples would produce the same output: val words = dataFile.map(line => line.split(“ “)) val words = dataFile.map(anyline => anyline.split(“ “)) val words = dataFile.map(_.split(“ “)) Figure 2-11: Passing anonymous functions to Spark in Scala Static singleton functions While anonymous functions are really helpful for short snippets of code, they are not very helpful when you want to request the framework for a complex data manipulation. Static singleton functions come to the rescue with their own nuances, which we will discuss in this section. In software engineering, the Singleton pattern is a design pattern that restricts instantiation of a class to one object. This is useful when exactly one object is needed to coordinate actions across the system. Static methods belong to the class and not an instance of it. They usually take input from the parameters, perform actions on it, and return the result. Figure 2-12: Passing static singleton functions to Spark in Scala Static singleton is the preferred way to pass functions, as technically you can create a class and call a method in the class instance. For example: class UtilFunctions{ def split(inputParam: String): Array[String] = {inputParam.split(“ “)} def operate(rdd: RDD[String]): RDD[String] ={ rdd.map(split)} } You can send a method in a class, but that has performance implications as the entire object would be sent along the method. Passing functions to Spark (Java) In Java, to create a function you will have to implement the interfaces available in the org.apache.spark.api.java function package. There are two popular ways to create such functions: Implement the interface in your own class, and pass the instance to Spark. Starting Java 8, you can use lambda expressions to pass off the functions to the Spark framework. Let’s reimplement the preceding word count examples in Java: Figure 2-13: Code example of Java implementation of word count (inline functions) If you belong to a group of programmers who feel that writing inline functions makes the code complex and unreadable (a lot of people do agree to that assertion), you may want to create separate functions and call them as follows: Figure 2-14: Code example of Java implementation of word count Passing functions to Spark (Python) Python provides a simple way to pass functions to Spark. The Spark programming guide available at spark.apache.org suggests there are three recommended ways to do this: Lambda expressions: The ideal way for short functions that can be written inside a single expression Local defs inside the function calling into Spark for longer code Top-level functions in a module While we have already looked at the lambda functions in some of the previous examples, let’s look at local definitions of the functions. Our example stays the same, which is we are trying to count the total number of words in a text file in Spark: def splitter(lineOfText): words = lineOfText.split(" ") return len(words) def aggregate(numWordsLine1, numWordsLineNext): totalWords = numWordsLine1 + numWordsLineNext return totalWords Let’s see the working code example: Figure 2-15: Code example of Python word count (local definition of functions) Here’s another way to implement this by defining the functions as a part of a UtilFunctions class, and referencing them within your map and reduce functions: Figure 2-16: Code example of Python word count (Utility class) You may want to be a bit cheeky here and try to add a countWords() to the UtilFunctions, so that it takes an RDD as input, and returns the total number of words. This method has potential performance implications as the whole object will need to be sent to the cluster. Let’s see how this can be implemented and the results in the following screenshot: Figure 2-17: Code example of Python word count (Utility class - 2) This can be avoided by making a copy of the referenced data field in a local object, rather than accessing it externally. Now that we have had a look at how to pass functions to Spark, and have already looked at some of the transformations and actions in the previous examples, including map, flatMap, and reduce, let’s look at the most common transformations and actions used in Spark. The list is not exhaustive, and you can find more examples in the Apache Spark documentation in the programming guide. If you would like to get a comprehensive list of all the available functions, you might want to check the following API docs: RDD PairRDD Scala http://bit.ly/2bfyoTo http://bit.ly/2bfzgah Python http://bit.ly/2bfyURl N/A Java http://bit.ly/2bfyRov http://bit.ly/2bfyOsH R http://bit.ly/2bfyrOZ N/A Table 2.1 – RDD and PairRDD API references Transformations The following table shows the most common transformations: map(func) coalesce(numPartitions) filter(func) repartition(numPartitions) flatMap(func) repartitionAndSortWithinPartitions(partitioner) mapPartitions(func) join(otherDataset, [numTasks]) mapPartitionsWithIndex(func) cogroup(otherDataset, [numTasks]) sample(withReplacement, fraction, seed) cartesian(otherDataset) Map(func) The map transformation is the most commonly used and the simplest of transformations on an RDD. The map transformation applies the function passed in the arguments to each of the elements of the source RDD. In the previous examples, we have seen the usage of map() transformation where we have passed the split() function to the input RDD. Figure 2-18: Operation of a map() function We’ll not give examples of map() functions as we have already seen plenty of examples of map functions previously. Filter (func) Filter, as the name implies, filters the input RDD, and creates a new dataset that satisfies the predicate passed as arguments. Example 2-1: Scala filtering example: val dataFile = sc.textFile(“README.md”) val linesWithApache = dataFile.filter(line => line.contains(“Apache”)) Example 2-2: Python filtering example: dataFile = sc.textFile(“README.md”) linesWithApache = dataFile.filter(lambda line: “Apache” in line) Example 2-3: Java filtering example: JavaRDD<String> dataFile = sc.textFile(“README.md”) JavaRDD<String> linesWithApache = dataFile.filter(line -> line.contains(“Apache”)); flatMap(func) The flatMap transformation is similar to map, but it offers a bit more flexibility. From the perspective of similarity to a map function, it operates on all the elements of the RDD, but the flexibility stems from its ability to handle functions that return a sequence rather than a single item. As you saw in the preceding examples, we had used flatMap to flatten the result of the split(“”) function, which returns a flattened structure rather than an RDD of string arrays. Figure 2-19: Operational details of the flatMap() transformation Let’s look at the flatMap example in Scala. Example 2-4: The flatmap() example in Scala: val favMovies = sc.parallelize(List("Pulp Fiction","Requiem for a dream","A clockwork Orange")); movies.flatMap(movieTitle=>movieTitle.split(" ")).collect() A flatMap in Python API would produce similar results. Example 2-5: The flatmap() example in Python: movies = sc.parallelize(["Pulp Fiction","Requiem for a dream","A clockwork Orange"]) movies.flatMap(lambda movieTitle: movieTitle.split(" ")).collect() The flatMap example in Java is a bit long-winded, but it essentially produces the same results. Example 2-6: The flatmap() example in Java: JavaRDD<String> movies = sc.parallelize (Arrays.asList("Pulp Fiction","Requiem for a dream" ,"A clockwork Orange") ); JavaRDD<String> movieName = movies.flatMap( new FlatMapFunction<String,String>(){ public Iterator<String> call(String movie){ return Arrays.asList(movie.split(" ")) .iterator(); } } ); Sample(withReplacement, fraction, seed) Sampling is an important component of any data analysis and it can have a significant impact on the quality of your results/findings. Spark provides an easy way to sample RDD’s for your calculations, if you would prefer to quickly test your hypothesis on a subset of data before running it on a full dataset. But here is a quick overview of the parameters that are passed onto the method: withReplacement: Is a Boolean (True/False), and it indicates if elements can be sampled multiple times (replaced when sampled out). Sampling with replacement means that the two sample values are independent. In practical terms this means that if we draw two samples with replacement, what we get on the first one doesn’t affect what we get on the second draw, and hence the covariance between the two samples is zero. If we are sampling without replacement, the two samples aren’t independent. Practically this means what we got on the first draw affects what we get on the second one and hence the covariance between the two isn’t zero. fraction: Fraction indicates the expected size of the sample as a fraction of the RDD’s size. The fraction must be between 0 and 1. For example, if you want to draw a 5% sample, you can choose 0.05 as a fraction. seed: The seed used for the random number generator. Let’s look at the sampling example in Scala. Example 2-7: The sample() example in Scala: val data = sc.parallelize( List(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); data.sample(true,0.1,12345).collect() The sampling example in Python looks similar to the one in Scala. Example 2-8: The sample() example in Python: data = sc.parallelize( [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]) data.sample(1,0.1,12345).collect() In Java, our sampling example returns an RDD of integers. Example 2-9: The sample() example in Java: JavaRDD<Integer> nums = sc.parallelize(Arrays.asList( 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); nums.sample(true,0.1,12345).collect(); References https://spark.apache.org/docs/latest/programming-guide.html http://www.purplemath.com/modules/numbprop.htm Summary We have gone through the concept of creating an RDD, to manipulating data within the RDD. We’ve looked at the transformations and actions available to an RDD, and walked you through various code examples to explain the differences between transformations and actions Resources for Article: Further resources on this subject: Getting Started with Apache Spark [article] Getting Started with Apache Spark DataFrames [article] Sabermetrics with Apache Spark [article]

0
0
13009

article-image-rxswift-part-2-querying-subjects

Darren Karl

23 Feb 2017

6 min read

RxSwift Part 2: Querying Subjects

Darren Karl

23 Feb 2017

6 min read

How many times should queries really be performed? Let's discuss what the right amount of querying is.To keep things simple initially, the constraints that we want to have are the following: It must display the queue information when we view the view controller. It needs to refresh the queue when we press a button from the view controller. Because we are assuming that the Observables we were working with are wrapped with the Observable.create() method as defined here, they are inherently cold observables, and cold observables only perform the observable code once, which is on subscription. We are also defining and wrapping the network call into a reliable Observable<[Queue]>, but what we need is to have this network call performed when the button is tapped. Introducing Subjects The desired behavior of the UITableView is to not simply “observe” the present state of the queue data (i.e. the behavior of a cold observable, which is to only have 1 emission or usually contain 1 element, and in our case, it is the current state of the queue);we want it to be reactive. We want it to change whenever we press a button and the data changes from the server. We can do this by using hot observables in the form of subjects. Take this time to browse briefly through the Reactivex documentation; a brief snippet of its description is shown below: "Because a Subject subscribes to an Observable, it will trigger that Observable to begin emitting items (if that Observable is “cold”, that is, if it waits for a subscription before it begins to emit items). This can have the effect of making the resulting Subject a “hot” Observable variant of the original “cold” Observable." Because a subject is both an observer and an observable, we can use it to subscribe to an observable. We can actually come up with this implementation: privatevar queues = BehaviorSubject<[Queue]>(value: []) privatevardisposeBag = DisposeBag() funconButtonPress() { getQueue() .subscribeOn(ConcurrentDispatchQueueScheduler(queue: networkQueue)) .observeOn(MainScheduler.instance) .subscribe(queues.asObserver()) .addDisposableTo(disposeBag) } The onButtonPress() method can even be called in viewDidLoad to load the first set of data. Now hook that up to an IBOutlet;pressing the button now works to query the data from the server! Great! Are we done? Not quite. A problem that might arise is that your users may become impatient while you load your data. What happens when they click the button repeatedly? By doing so, the user will have created subscriptions as many times as they have impatiently clicked the button. This is not how we want our code to work. You can look into operators such as flatMapLatest, which ignores the previous subscription if there is a new subscription. You might get pointed to the throttle or debounce operator, which only emits an item from an Observable if a particular timespan has passed without it emitting another item. But we don’t expect our average users to tap like crazy, but this is where you begin to explore the other operators. A quick UI fix might be to present an alert telling the users that the data is being loaded (and praying that the data is loaded faster than the user can close the alert and click the button again), but personally, I believe itdoesn’t provide a great user experience. I personally prefer a balance of flatMapLatest so that old subscriptions are ignored. It is ignored with an additional UI tweak of disabling the user interface by overlaying a UIView so that the user is informed that the data is loading and so that the button temporarily doesn’t recognize taps. Pushing further; more reactive If you want your app to be even more reactive, you can replace the trigger (the button press) with somethingelse! You want it to poll data every 5 minutes? Use a timer! You can even hook up your BehaviorSubject to some kind of synchronization library or use WebSocket technology so that when your server informs you that data has changed, your app responds in real time! Tread carefully, because you need to think about synchronization issues and thread safety here. But why do we use a BehaviorSubject? Why not a different kind of Subject or a Variable? Reactivex describes BehaviorSubject below: "When an observer subscribes to a BehaviorSubject, it begins by emitting the item most recently emitted by the source Observable (or a seed/default value if none has yet been emitted), and then continues to emit any other items emitted later by the source Observable(s)." In the same way, we want our UITableView to be able to provide something when the UITableView subscribes to it. We can seed it initially with an empty array of Queue, and then later, load the data from the database or from a network query. Dealing with the abstract Hopefully, this two-part series has enlightened and encouraged you in your professional journey in learning the benefits of functional reactive programming that RxSwift provides. My personal take on it is that the Reactive Extensions arequite difficult to grasp because they are abstract. Theabstraction allows more experienced programmers to use the appropriate tools when deemed fit. I hope that this series has brought the beautiful abstract definitions into a more concrete example and that it has been a gentle introduction to what hot and cold observables are, what subjects are, and how to use them. I framed them with simple assumptions that are quite common in real-life situations within the context of application development to make it much more relatable and concrete. About the Author Darren Karl Sapalo is a software developer, an advocate for UX, and a student taking up his Master's degree in Computer Science. He enjoyed developing games on his free time when he was twelve. Finally finished with his undergraduate thesis on computer vision, he took up some industry work with Apollo Technologies Inc., developing for both Android and iOS platforms.

0
0
2237

How-To Tutorials

Learn from Data

Members Inheritance and Polymorphism

What is D3.js?

Toy Bin

Getting Started with Kotlin

Data Pipelines

The NumPy array object

Getting Started with Salesforce Lightning Experience

Functions with Arduino

Preparing the Initial Two Nodes

Trending Topics

Functional Building Blocks

Developer Workflow

Lightweight messaging with MQTT 3.1.1 and Mosquitto

Understanding Spark RDD

RxSwift Part 2: Querying Subjects

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access