Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7008 Articles
article-image-proxmox-ve-fundamentals
Packt
04 Apr 2016
12 min read
Save for later

Proxmox VE Fundamentals

Packt
04 Apr 2016
12 min read
In this article written by Rik Goldman author of the book Learning Proxmox VE, we introduce to you Proxmox Virtual Environment (PVE) which is a mature, complete, well-supported, enterprise-class virtualization environment for servers. It is an open source tool—based in the Debian GNU/Linux distribution—that manages containers, virtual machines, storage, virtualized networks, and high-availability clustering through a well-designed, web-based interface or via the command-line interface. (For more resources related to this topic, see here.) Developers provided the first stable release of Proxmox VE in 2008; 4 years and eight point releases later, ZDNet's Ken Hess boldly, but quite sensibly, declared Proxmox VE as Proxmox: The Ultimate Hypervisor (http://www.zdnet.com/article/proxmox-the-ultimate-hypervisor/). Four years later, PVE is on version 4.1, in use by at least 90,000 hosts, and more than 500 commercial customers in 140 countries; the web-based administrative interface itself is translated into nineteen languages. This article will explore the fundamental technologies underlying PVE's hypervisor features: LXC, KVM, and QEMU. To do so, we will develop a working understanding of virtual machines, containers, and their appropriate use. We will cover the following topics: Proxmox VE in brief Virtualization and containerization with PVE Proxmox VE virtual machines, KVM, and QEMU Containerization with PVE and LXC Proxmox VE in brief With Proxmox VE, Proxmox Server Solutions GmbH (https://www.proxmox.com/en/about) provides us with an enterprise-ready, open source type II hypervisor. Later, you'll find some of the features that make Proxmox VE such a strong enterprise candidate. The license for Proxmox VE is very deliberately the GNU Affero General Public License (V3) (https://www.gnu.org/licenses/agpl-3.0.html). From among the many free and open source compatible licenses available, this is a significant choice because it is "specifically designed to ensure cooperation with the community in the case of network server software." PVE is primarily administered from an integrated web interface or from the command line locally or via SSH. Consequently, there is no need for a separate management server and the associated expenditure. In this way, Proxmox VE significantly contrasts with alternative enterprise virtualization solutions by vendors such as VMware. Proxmox VE instances/nodes can be incorporated into PVE clusters, and centrally administered from a unified web interface. Proxmox VE provides for live migration—the movement of a virtual machine or container from one cluster node to another without any disruption of services. This is a rather unique feature to PVE and not common to competing products. Features Proxmox VE VMware vSphere Hardware requirements Flexible Strict compliance with HCL Integrated management interface Web- and shell-based (browser and SSH) No. Requires dedicated management server at additional cost Simple subscription structure Yes; based on number of premium support tickets per year and CPU socket count No High availability Yes Yes VM live migration Yes Yes Supports containers Yes No Virtual machine OS support Windows and Linux Windows, Linux, and Unix Community support Yes No Live VM snapshots Yes Yes Contrasting Proxmox VE and VMware vSphere features For a complete catalog of features, see the Proxmox VE datasheet at https://www.proxmox.com/images/download/pve/docs/Proxmox-VE-Datasheet.pdf. Like its competitors, PVE is a hypervisor: a typical hypervisor is software that creates, runs, configures, and manages virtual machines based on an administrator or engineer's choices. PVE is known as a type II hypervisor because the virtualization layer is built upon an operating system. As a type II hypervisor, Proxmox VE is built on the Debian project. Debian is a GNU/Linux distribution renowned for its reliability, commitment to security, and its thriving and dedicated community of contributing developers. A type II hypervisor, such as PVE, runs directly over the operating system. In Proxmox VE's case, the operating system is Debian; since the release of PVE 4.0, the underlying operating system has been Debian "Jessie." By contrast, a type I hypervisor (such as VMware's ESXi) runs directly on bare metal without the mediation of an operating system. It has no additional function beyond managing virtualization and the physical hardware. A type I hypervisor runs directly on hardware, without the mediation of an operating system. As a type II hypervisor, Proxmox VE is built on the Debian project. Debian is a GNU/Linux distribution renowned for its reliability, commitment to security, and its thriving and dedicated community of contributing developers. Debian-based GNU/Linux distributions are arguably the most popular GNU/Linux distributions for the desktop. One characteristic that distinguishes Debian from competing distribution is its release policy: Debian releases only when its development community can stand behind it for its stability, security, and usability. Debian does not distinguish between long-term support releases and regular releases as do some other distributions. Instead, all Debian releases receive strong support and critical updates through the first year following the next release. (Since 2007, a major release of Debian has been made about every two years. Debian 8, Jessie, was released just about on schedule in 2015. Proxmox VE's reliance on Debian is thus a testament to its commitment to these values: stability, security, and usability over scheduled releases that favor cutting-edge features. PVE provides its virtualization functionality through three open technologies, and the efficiency with which they're integrated by its administrative web interface: LXC KVM QEMU To understand how this foundation serves Proxmox VE, we must first be able to clearly understand the relationship between virtualization (or, specifically, hardware virtualization) and containerization (OS virtualization). As we proceed, their respective use cases should become clear. Virtualization and containerization with Proxmox VE It is correct to ultimately understand containerization as a type of virtualization. However, here, we'll look first to conceptually distinguish a virtual machine from a container by focusing on contrasting characteristics. Simply put, virtualization is a technique through which we provide fully-functional, computing resources without a demand for resources' physical organization, locations, or relative proximity. Briefly put, virtualization technology allows you to share and allocate the resources of a physical computer into multiple execution environments. Without context, virtualization is a vague term, encapsulating the abstraction of such resources as storage, networks, servers, desktop environments, and even applications from their concrete hardware requirements through software implementation solutions called hypervisors. Virtualization thus affords us more flexibility, more functionality, and a significant positive impact on our budgets—often realized with merely the resources we have at hand. In terms of PVE, virtualization most commonly refers to the abstraction of all aspects of a discrete computing system from its hardware. In this context, virtualization is the creation, in other words, of a virtual machine or VM, with its own operating system and applications. A VM may be initially understood as a computer that has the same functionality as a physical machine. Likewise, it may be incorporated and communicated with via a network exactly as a machine with physical hardware would. Put yet another way, from inside a VM, we will experience no difference from which we can distinguish it from a physical computer. The virtual machine, moreover, hasn't the physical footprint of its physical counterparts. The hardware it relies on is, in fact, provided by software that borrows from the hardware resources from a host installed on a physical machine (or bare metal). Nevertheless, the software components of the virtual machine, from the applications to the operating system, are distinctly separated from those of the host machine. This advantage is realized when it comes to allocating physical space for resources. For example, we may have a PVE server running a web server, database server, firewall, and log management system—all as discrete virtual machines. Rather than consuming the physical space, resources, and labor of maintaining four physical machines, we simply make physical room for the single Proxmox VE server and configure an appropriate virtual LAN as necessary. In a white paper entitled Putting Server Virtualization to Work, AMD articulates well the benefits of virtualization to businesses and developers (https://www.amd.com/Documents/32951B_Virtual_WP.pdf): Top 5 business benefits of virtualization: Increases server utilization Improves service levels Streamlines manageability and security Decreases hardware costs Reduces facility costs The benefits of virtualization with a development and test environment: Lowers capital and space requirements. Lowers power and cooling costs Increases efficiencies through shorter test cycles Faster time-to-market To these benefits, let's add portability and encapsulation: the unique ability to migrate a live VM from one PVE host to another—without suffering a service outage. Proxmox VE makes the creation and control of virtual machines possible through the combined use of two free and open source technologies: Kernel-based Virtual Machine (or KVM) and Quick Emulator (QEMU). Used together, we refer to this integration of tools as KVM-QEMU. KVM KVM has been an integral part of the Linux kernel since February, 2007. This kernel module allows GNU/Linux users and administrators to take advantage of an architecture's hardware virtualization extensions; for our purposes, these extensions are AMD's AMD-V and Intel's VT-X for the x86_64 architecture. To really make the most of Proxmox VE's feature set, you'll therefore very much want to install on an x86_64 machine with a CPU with integrated virtualization extensions. For a full list of AMD and Intel processors supported by KVM, visit Intel at http://ark.intel.com/Products/VirtualizationTechnology or AMD at http://support.amd.com/en-us/kb-articles/Pages/GPU120AMDRVICPUsHyperVWin8.aspx. QEMU QEMU provides an emulation and virtualization interface that can be scripted or otherwise controlled by a user. Visualizing the relationship between KVM and QEMU Without Proxmox VE, we could essentially define the hardware, create a virtual disk, and start and stop a virtualized server from the command line using QEMU. Alternatively, we could rely on any one of an array of GUI frontends for QEMU (a list of GUIs available for various platforms can be found at http://wiki.qemu.org/Links#GUI_Front_Ends). Of course, working with these solutions is productive only if you're interested in what goes on behind the scenes in PVE when virtual machines are defined. Proxmox VE's management of virtual machines is itself managing QEMU through its API. Managing QEMU from the command line can be tedious. The following is a line from a script that launched Raspbian, a Debian remix intended for the architecture of the Raspberry Pi, on an x86 Intel machine running Ubuntu. When we see how easy it is to manage VMs from Proxmox VE's administrative interfaces, we'll sincerely appreciate that relative simplicity: qemu-system-arm -kernel kernel-qemu -cpu arm1176 -m 256 -M versatilepb -no-reboot -serial stdio -append "root=/dev/sda2 panic=1" -hda ./$raspbian_img -hdb swap If you're familiar with QEMU's emulation features, it's perhaps important to note that we can't manage emulation through the tools and features Proxmox VE provides—despite its reliance on QEMU. From a bash shell provided by Debian, it's possible. However, the emulation can't be controlled through PVE's administration and management interfaces. Containerization with Proxmox VE Containers are a class of virtual machines (as containerization has enjoyed a renaissance since 2005, the term OS virtualization has become synonymous with containerization and is often used for clarity). However, by way of contrast with VMs, containers share operating system components, such as libraries and binaries, with the host operating system; a virtual machine does not. Visually contrasting virtual machines with containers The container advantage This arrangement potentially allows a container to run leaner and with fewer hardware resources borrowed from the host. For many authors, pundits, and users, containers also offer a demonstrable advantage in terms of speed and efficiency. (However, it should be noted here that as resources such as RAM and more powerful CPUs become cheaper, this advantage will diminish.) The Proxmox VE container is made possible through LXC from version 4.0 (it's made possible through OpenVZ in previous PVE versions). LXC is the third fundamental technology serving Proxmox VE's ultimate interest. Like KVM and QEMU, LXC (or Linux Containers) is an open source technology. It allows a host to run, and an administrator to manage, multiple operating system instances as isolated containers on a single physical host. Conceptually then, a container very clearly represents a class of virtualization, rather than an opposing concept. Nevertheless, it's helpful to maintain a clear distinction between a virtual machine and a container as we come to terms with PVE. The ideal implementation of a Proxmox VE guest is contingent on our distinguishing and choosing between a virtual-machine solution and a container solution. Since Proxmox VE containers share components of the host operating system and can offer advantages in terms of efficiency, this text will guide you through the creation of containers whenever the intended guest can be fully realized with Debian Jessie as our hypervisor's operating system without sacrificing features. When our intent is a guest running a Microsoft Windows operating system, for example, a Proxmox VE container ceases to be a solution. In such a case, we turn, instead, to creating a virtual machine. We must rely on a VM precisely because the operating system components that Debian can share with a Linux container are not components a Microsoft Windows operating system can make use of. Summary In this article, we have come to terms with the three open source technologies that provide Proxmox VE's foundational features: containerization and virtualization with LXC, KVM, and QEMU. Along the way, we've come to understand that containers, while being a type of virtualization, have characteristics that distinguish them from virtual machines. These differences will be crucial as we determine which technology to rely on for a virtual server solution with Proxmox VE. Resources for Article: Further resources on this subject: Deploying App-V 5 in a Virtual Environment[article] Setting Up a Spark Virtual Environment[article] Basic Concepts of Proxmox Virtual Environment[article]
Read more
  • 0
  • 0
  • 32251

Packt
04 Apr 2016
20 min read
Save for later

Morphology – Getting Our Feet Wet

Packt
04 Apr 2016
20 min read
In this article by Deepti Chopra, Nisheeth Joshi, and Iti Mathur authors of the book Mastering Natural Language Processing with Python, morphology may be defined as the study of the composition of words using morphemes. A morpheme is the smallest unit of the language that has a meaning. In this article, we will discuss stemming and lemmatizing, creating a stemmer and lemmatizer for non-English languages, developing a morphological analyzer and morphological generator using machine learning tools, creating a search engine, and many other concepts. In brief, this article will include the following topics: Introducing morphology Creating a stemmer and lemmatizer Developing a stemmer for non-English languages Creating a morphological analyzer Creating a morphological generator Creating a search engine (For more resources related to this topic, see here.) Introducing morphology Morphology may be defined as the study of the production of tokens with the help of morphemes. A morpheme is the basic unit of language, which carries a meaning. There are two types of morphemes: stems and affixes (suffixes, prefixes, infixes, and circumfixes). Stems are also referred to as free morphemes since they can even exist without adding affixes. Affixes are referred to as bound morphemes since they cannot exist in a free form, and they always exist along with free morphemes. Consider the word "unbelievable". Here, "believe" is a stem or free morpheme. It can even exist on its own. The morphemes "un" and "able" are affixes or bound morphemes. They cannot exist in s free form but exist together with a stem. There are three kinds of languages, namely isolating languages, agglutinative languages, and inflecting languages. Morphology has different meanings in all these languages. Isolating languages are those languages in which words are merely free morphemes, and they do not carry any tense (past, present, and future) or number (singular or plural) information. Mandarin Chinese is an example of an isolating language. Agglutinative languages are those languages in which small words combine together to convey compound information. Turkish is an example of an agglutinative language. Inflecting languages are languages in which words are broken down into simpler units, but all these simpler units exhibit different meanings. Latin is an example of an inflecting language. There are morphological processes such as inflections, derivations, semi-affixes, combining forms, and cliticization. An inflection refers to transforming a word into a form so that it represents a person, number, tense, gender, case, aspect, and mood. Here, the syntactic category of the token remains the same. In derivation, the syntactic category of word is also changed. Semi-affixes are bound morphemes that exhibit a word-like quality, for example, noteworthy, antisocial, anticlockwise, and so on. Understanding stemmers Stemming may be defined as the process of obtaining a stem from a word by eliminating the affixes from it. For example, in the word "raining", a stemmer would return the root word or the stem word "rain" by removing the affix "ing" from "raining". In order to increase the accuracy of information retrieval, search engines mostly use stemming to get a stem and store it as an index word. Search engines call words with the same meaning synonyms, which may be a kind of query expansion known as conflation. Martin Porter has designed a well-known stemming algorithm known as the Porter Stemming Algorithm. This algorithm is basically designed to replace and eliminate some well-known suffices present in English words. To perform stemming in NLTK, we can simply perform the instantiation of the PorterStemmer class, and then perform stemming by calling the stem method. Let's take a look at the code for stemming using the PorterStemmer class in NLTK: >>> import nltk>>> from nltk.stem import PorterStemmer>>> stemmerporter = PorterStemmer()>>> stemmerporter.stem('working')'work'>>> stemmerporter.stem('happiness')'happi' The PorterStemmer class is trained and has the knowledge of many stems and word forms in the English language. The process of stemming takes place in a series of steps and transforms a word into a shorter word or this word may similar meaning to the root word. The stemmer I interface defines the stem() method, and all stemmers are inherited from this interface. The inheritance diagram is depicted here: Another Stemming algorithm, known as the Lancaster Stemming algorithm, was introduced in Lancaster University. Similar to the PorterStemmer class, the LancasterStemmer class is used in NLTK to implement Lancaster Stemming. Let's consider the following code, which depicts Lancaster stemming in NLTK: >>> import nltk >>> from nltk.stem import LancasterStemmer >>> stemmerlan=LancasterStemmer() >>> stemmerlan.stem('working') 'work' >>> stemmerlan.stem('happiness') 'happy' We can also build our own stemmer in NLTK using RegexpStemmer. This works by accepting a string and eliminates it from the prefix or suffix of a word when a match is found. Let's consider an example of stemming using RegexpStemmer in NLTK: >>> import nltk >>> from nltk.stem import RegexpStemmer >>> stemmerregexp=RegexpStemmer('ing') >>> stemmerregexp.stem('working') 'work' >>> stemmerregexp.stem('happiness') 'happiness' >>> stemmerregexp.stem('pairing') 'pair' We can use RegexpStemmer in cases where stemming cannot be performed using PorterStemmer and LancasterStemmer. The SnowballStemmer class is used to perform stemming in 13 languages other than English. In order to perform stemming using SnowballStemmer, firstly, an instance is created in the language where stemming needs to be performed, and then using the stem() method, stepping is performed. Consider the following example to perform stemming in Spanish and French in NLTK using SnowballStemmer: >>> import nltk >>> from nltk.stem import SnowballStemmer >>> SnowballStemmer.languages ('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish') >>> spanishstemmer=SnowballStemmer('spanish') >>> spanishstemmer.stem('comiendo') 'com' >>> frenchstemmer=SnowballStemmer('french') >>> frenchstemmer.stem('manger') 'mang' Nltk.stem.api consists of the stemmer I class in which the stem function is performed. Consider the following code present in NLTK, which enables stemming to be performed: Class StemmerI(object): """ It is an interface that helps to eliminate morphological affixes from the tokens and the process is known as stemming. """ def stem(self, token): """ Eliminate affixes from token and stem is returned. """ raise NotImplementedError() Here's the code used to perform stemming using multiple stemmers: >>> import nltk >>> from nltk.stem.porter import PorterStemmer >>> from nltk.stem.lancaster import LancasterStemmer >>> from nltk.stem import SnowballStemmer >>> def obtain_tokens(): With open('/home/p/NLTK/sample1.txt') as stem: tok = nltk.word_tokenize(stem.read()) return tokens >>> def stemming(filtered): stem=[] for x in filtered: stem.append(PorterStemmer().stem(x)) return stem >>> if_name_=="_main_": tok= obtain_tokens() >>> print("tokens is %s")%(tok) >>> stem_tokens= stemming(tok) >>> print("After stemming is %s")%stem_tokens >>> res=dict(zip(tok,stem_tokens)) >>> print("{tok:stemmed}=%s")%(result) Understanding lemmatization Lemmatization is the process in which we transform a word into a form that has a different word category. The word formed after lemmatization is entirely different from what it was initially. Consider an example of lemmatization in NLTK: >>> import nltk >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer_output=WordNetLemmatizer() >>> lemmatizer_output.lemmatize('working') 'working' >>> lemmatizer_output.lemmatize('working',pos='v') 'work' >>> lemmatizer_output.lemmatize('works') 'work' WordNetLemmatizer may be defined as a wrapper around the so-called WordNet corpus, and it makes use of the morphy() function present in WordNetCorpusReader to extract a lemma. If no lemma is extracted, then the word is only returned in its original form. For example, for 'works', the lemma that is returned is in the singular form 'work'. This code snippet illustrates the difference between stemming and lemmatization: >>> import nltk >>> from nltk.stem import PorterStemmer >>> stemmer_output=PorterStemmer() >>> stemmer_output.stem('happiness') 'happi' >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer_output.lemmatize('happiness') 'happiness' In the preceding code, 'happiness' is converted to 'happi' by stemming it. Lemmatization can't find the root word for 'happiness', so it returns the word "happiness". Developing a stemmer for non-English languages Polyglot is a software that is used to provide models called morfessor models, which are used to obtain morphemes from tokens. The Morpho project's goal is to create unsupervised data-driven processes. Its focuses on the creation of morphemes, which are the smallest units of syntax. Morphemes play an important role in natural language processing. They are useful in automatic recognition and the creation of language. With the help of the vocabulary dictionaries of polyglot, morfessor models on 50,000 tokens of different languages was used. Here's the code to obtain a language table using a polyglot: from polyglot.downloader import downloader print(downloader.supported_languages_table("morph2")) The output obtained from the preceding code is in the form of these languages listed as follows: 1. Piedmontese language 2. Lombard language 3. Gan Chinese 4. Sicilian 5. Scots 6. Kirghiz, Kyrgyz 7. Pashto, Pushto 8. Kurdish 9. Portuguese 10. Kannada 11. Korean 12. Khmer 13. Kazakh 14. Ilokano 15. Polish 16. Panjabi, Punjabi 17. Georgian 18. Chuvash 19. Alemannic 20. Czech 21. Welsh 22. Chechen 23. Catalan; Valencian 24. Northern Sami 25. Sanskrit (Saṁskṛta) 26. Slovene 27. Javanese 28. Slovak 29. Bosnian-Croatian-Serbian 30. Bavarian 31. Swedish 32. Swahili 33. Sundanese 34. Serbian 35. Albanian 36. Japanese 37. Western Frisian 38. French 39. Finnish 40. Upper Sorbian 41. Faroese 42. Persian 43. Sinhala, Sinhalese 44. Italian 45. Amharic 46. Aragonese 47. Volapük 48. Icelandic 49. Sakha 50. Afrikaans 51. Indonesian 52. Interlingua 53. Azerbaijani 54. Ido 55. Arabic 56. Assamese 57. Yoruba 58. Yiddish 59. Waray-Waray 60. Croatian 61. Hungarian 62. Haitian; Haitian Creole 63. Quechua 64. Armenian 65. Hebrew (modern) 66. Silesian 67. Hindi 68. Divehi; Dhivehi; Mald... 69. German 70. Danish 71. Occitan 72. Tagalog 73. Turkmen 74. Thai 75. Tajik 76. Greek, Modern 77. Telugu 78. Tamil 79. Oriya 80. Ossetian, Ossetic 81. Tatar 82. Turkish 83. Kapampangan 84. Venetian 85. Manx 86. Gujarati 87. Galician 88. Irish 89. Scottish Gaelic; Gaelic 90. Nepali 91. Cebuano 92. Zazaki 93. Walloon 94. Dutch 95. Norwegian 96. Norwegian Nynorsk 97. West Flemish 98. Chinese 99. Bosnian 100. Breton 101. Belarusian 102. Bulgarian 103. Bashkir 104. Egyptian Arabic 105. Tibetan Standard, Tib... 106. Bengali 107. Burmese 108. Romansh 109. Marathi (Marāthī) 110. Malay 111. Maltese 112. Russian 113. Macedonian 114. Malayalam 115. Mongolian 116. Malagasy 117. Vietnamese 118. Spanish; Castilian 119. Estonian 120. Basque 121. Bishnupriya Manipuri 122. Asturian 123. English 124. Esperanto 125. Luxembourgish, Letzeb... 126. Latin 127. Uighur, Uyghur 128. Ukrainian 129. Limburgish, Limburgan... 130. Latvian 131. Urdu 132. Lithuanian 133. Fiji Hindi 134. Uzbek 135. Romanian, Moldavian, ... The necessary models can be downloaded using the following code: %%bash polyglot download morph2.en morph2.ar [polyglot_data] Downloading package morph2.en to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package morph2.en is already up-to-date! [polyglot_data] Downloading package morph2.ar to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package morph2.ar is already up-to-date! Consider this example that obtains output from a polyglot: from polyglot.text import Text, Word tokens =["unconditional" ,"precooked", "impossible", "painful", "entered"] for s in tokens: s=Word(s, language="en") print("{:<20}{}".format(s,s.morphemes)) unconditional ['un','conditional'] precooked ['pre','cook','ed'] impossible ['im','possible'] painful ['pain','ful'] entered ['enter','ed'] If tokenization is not performed properly, then we can perform morphological analysis for the process of splitting text into its original constituents: sent="Ihopeyoufindthebookinteresting" para=Text(sent) para.language="en" para.morphemes WordList(['I','hope','you','find','the','book','interesting']) A morphological analyzers Morphological analysis may be defined as the process of obtaining grammatical information about a token given its suffix information. Morphological analysis can be performed in three ways: Morpheme-based morphology (or the item and arrangement approach), Lexeme-based morphology (or the item and process approach), and Word-based morphology (or the word and paradigm approach). A morphological analyzer may be defined as a program that is responsible for the analysis of the morphology of a given input token. It analyzes a given token and generates morphological information, such as gender, number, class, and so on, as an output. In order to perform morphological analysis on a given non-whitespace token, pyEnchant dictionary is used. Consider the following code that performs morphological analysis: >>> import enchant >>> s = enchant.Dict("en_US") >>> tok=[] >>> def tokenize(st1): if not st1:return for j in xrange(len(st1),-1,-1): if s.check(st1[0:j]): tok.append(st1[0:i]) st1=st[j:] tokenize(st1) break >>> tokenize("itismyfavouritebook") >>> tok ['it', 'is', 'my','favourite','book'] >>> tok=[ ] >>> tokenize("ihopeyoufindthebookinteresting") >>> tok ['i','hope','you','find','the','book','interesting'] We can determine the category of a word as follows: Morphological hints: Suffix information helps us to detect the category of a word. For example, -ness and –ment suffixes exist with nouns. Syntactic hints: Contextual information is conducive in determining the category of a word. For example, if we have found a word that has a noun category, then syntactic hints will be useful in determining whether an adjective will appear before the noun or after the noun category. Semantic hints: A semantic hint is also useful in determining the category of a word. For example, if we already know that a word represents the name of a location, then it will fall under the noun category. Open class: This refers to the class of words that are not fixed and each day, their number keeps on increasing whenever a new word is added to the list. Words in an open class are usually in the form of nouns. Prepositions are mostly a closed class. Morphology captured by the part of speech tagset: Part of Speech tagset capture information that helps us to perform morphology. For example, the word 'plays' would appear with the third person and singular noun. Omorfi (the open morphology of Finnish) is a package that has been licensed by version 3 of GNU GPL. It is used for the purpose of performing numerous tasks such as language modeling, morphological analysis, rule-based machine translations, information retrieval, statistical machine translations, morphological segmentation, ontologies, and spell checking and correction. A morphological generator A morphological generator is a program that performs the task of morphological generations. Morphological generation may be considered the opposite of morphological analysis. Here, given the description of a word in terms of its number, category, stem, and so on, the original word is retrieved. For example, if root = go, Part of Speech = verb, tense= present, and if it occurs along with a third person and singular subject, then the morphological generator would generate its surface form, that is, goes. There are many Python-based software that perform morphological analysis and generation. Some of them are as follows: ParaMorfo: This is used to perform the morphological generation and analysis of Spanish and Guarani nouns, adjectives, and verbs HornMorpho: This is used for the morphological generation and analysis of Oromo and Amharic nouns and verbs as well as Tigrinya verbs AntiMorfo: This is used for the morphological generation and analysis of Quechua adjectives, verbs, and nouns as well as Spanish verbs MorfoMelayu: This is used for the morphological analysis of Malay words Other examples of software that is used to perform morphological analysis and generation are as follows: Morph is a morphological generator and analyzer for the English language and the RASP system Morphy is a morphological generator, analyzer, and POS tagger for German Morphisto is a morphological generator and analyzer for German Morfette performs supervised learning (inflectional morphology) for Spanish and French Search engines PyStemmer 1.0.1 consists of Snowball stemming algorithms that are conducive for performing information retrieval tasks and the construction of a search engine. It consists of the Porter stemming algorithm and many other stemming algorithms that are useful for the purpose of performing stemming and information retrieval tasks in many languages, including many European languages. We can construct a vector space search engine by converting the texts into vectors. Here are the steps needed to construct a vector space search engine: Stemming and elimination of stop words. A stemmer is a program that accepts words and converts them into stems. Tokens that have same stem almost have the same meanings. Stop words are also eliminated from text. Consider the following code for the removal of stop words and tokenization: def eliminatestopwords(self,list): " " " Eliminate words which occur often and have not much significance from context point of view. " " " return[ word for word in list if word not in self.stopwords ] def tokenize(self,string): " " " Perform the task of splitting text into stop words and tokens " " " Str=self.clean(str) Words=str.split(" ") return [self.stemmer.stem(word,0,len(word)-1) for word in words] Mapping keywords into vector dimensions.Here's the code required to perform the mapping of keywords into vector dimensions: def obtainvectorkeywordindex(self, documentList): " " " In the document vectors, generate the keyword for the given position of element " " " #Perform mapping of text into strings vocabstring = " ".join(documentList) vocablist = self.parser.tokenise(vocabstring) #Eliminate common words that have no search significance vocablist = self.parser.eliminatestopwords(vocablist) uniqueVocablist = util.removeDuplicates(vocablist) vectorIndex={} offset=0 #Attach a position to keywords that performs mapping with dimension that is used to depict this token for word in uniqueVocablist: vectorIndex[word]=offset offset+=1 return vectorIndex #(keyword:position) Mapping of text strings to vectorsHere, a simple term count model is used. The code to convert text strings into vectors is as follows: def constructVector(self, wordString): # Initialise the vector with 0's Vector_val = [0] * len(self.vectorKeywordIndex) tokList = self.parser.tokenize(tokString) tokList = self.parser.eliminatestopwords(tokList) for word in toklist: vector[self.vectorKeywordIndex[word]] += 1; # simple Term Count Model is used return vector Searching similar documents By finding the cosine of an angle between the vectors of a document, we can prove whether two given documents are similar or not. If the cosine value is 1, then the angle value is 0 degrees and vectors are said to be parallel (this means that documents are related). If the cosine value is 0 and the value of the angle is 90 degrees, then vectors are said to be perpendicular (this means that documents are not related). This is the code to compute the cosine between the text vector using scipy: def cosine(vec1, vec2): """ cosine = ( X * Y ) / ||X|| x ||Y|| """ return float(dot(vec1,vec2) / (norm(vec1) * norm(vec2))) Search keywords We perform the mapping of keywords to a vector space. We construct a temporary text that represents items to be searched and then compare it with document vectors with the help of a cosine measurement. Here is the following code needed to search for the vector space: def searching(self,searchinglist): """ search for text that are matched on the basis of list of items """ askVector = self.buildQueryVector(searchinglist) ratings = [util.cosine(askVector, textVector) for textVector in self.documentVectors] ratings.sort(reverse=True) return ratings The following code can be used to detect languages from a source text: >>> import nltk >>> import sys >>> try: from nltk import wordpunct_tokenize from nltk.corpus import stopwords except ImportError: print( 'Error has occured') #---------------------------------------------------------------------- >>> def _calculate_languages_ratios(text): """ Compute probability of given document that can be written in different languages and give a dictionary that appears like {'german': 2, 'french': 4, 'english': 1} """ languages_ratios = {} ''' nltk.wordpunct_tokenize() splits all punctuations into separate tokens wordpunct_tokenize("I hope you like the book interesting .") [' I',' hope ','you ','like ','the ','book' ,'interesting ','.'] ''' tok = wordpunct_tokenize(text) wor = [word.lower() for word in tok] # Compute occurence of unique stopwords in a text for language in stopwords.fileids(): stopwords_set = set(stopwords.words(language)) words_set = set(words) common_elements = words_set.intersection(stopwords_set) languages_ratios[language] = len(common_elements) # language "score" return languages_ratios #---------------------------------------------------------------- >>> def detect_language(text): """ Compute the probability of given text that is written in different languages and obtain the one that is highest scored. It makes use of stopwords calculation approach, finds out unique stopwords present in a analyzed text. """ ratios = _calculate_languages_ratios(text) most_rated_language = max(ratios, key=ratios.get) return most_rated_language if __name__=='__main__': text = ''' All over this cosmos, most of the people believe that there is an invisible supreme power that is the creator and the runner of this world. Human being is supposed to be the most intelligent and loved creation by that power and that is being searched by human beings in different ways into different things. As a result people reveal His assumed form as per their own perceptions and beliefs. It has given birth to different religions and people are divided on the name of religion viz. Hindu, Muslim, Sikhs, Christian etc. People do not stop at this. They debate the superiority of one over the other and fight to establish their views. Shrewd people like politicians oppose and support them at their own convenience to divide them and control them. It has intensified to the extent that even parents of a new born baby teach it about religious differences and recommend their own religion superior to that of others and let the child learn to hate other people just because of religion. Jonathan Swift, an eighteenth century novelist, observes that we have just enough religion to make us hate, but not enough to make us love one another. The word 'religion' does not have a derogatory meaning - A literal meaning of religion is 'A personal or institutionalized system grounded in belief in a God or Gods and the activities connected with this'. At its basic level, 'religion is just a set of teachings that tells people how to lead a good life'. It has never been the purpose of religion to divide people into groups of isolated followers that cannot live in harmony together. No religion claims to teach intolerance or even instructs its believers to segregate a certain religious group or even take the fundamental rights of an individual solely based on their religious choices. It is also said that 'Majhab nhi sikhata aaps mai bair krna'. But this very majhab or religion takes a very heinous form when it is misused by the shrewd politicians and the fanatics e.g. in Ayodhya on 6th December, 1992 some right wing political parties and communal organizations incited the Hindus to demolish the 16th century Babri Masjid in the name of religion to polarize Hindus votes. Muslim fanatics in Bangladesh retaliated and destroyed a number of temples, assassinated innocent Hindus and raped Hindu girls who had nothing to do with the demolition of Babri Masjid. This very inhuman act has been presented by Taslima Nasrin, a Banglsdeshi Doctor-cum-Writer in her controversial novel 'Lajja' (1993) in which, she seems to utilizes fiction's mass emotional appeal, rather than its potential for nuance and universality. ''' >>> language = detect_language(text) >>> print(language) The preceding code will search for stop words and detect the language of the text, which is English. Summary In this article, we discussed stemming, lemmatization, and morphological analysis and generation. Resources for Article: Further resources on this subject: How is Python code organized[article] Machine learning and Python – the Dream Team[article] Putting the Fun in Functional Python[article]
Read more
  • 0
  • 0
  • 4569

article-image-how-get-started-redux-react-native
Emilio Rodriguez
04 Apr 2016
5 min read
Save for later

How To Get Started with Redux in React Native

Emilio Rodriguez
04 Apr 2016
5 min read
In mobile development there is a need for architectural frameworks, but complex frameworks designed to be used in web environments may end up damaging the development process or even the performance of our app. Because of this, some time ago I decided to introduce in all of my React Native projects the leanest framework I ever worked with: Redux. Redux is basically a state container for JavaScript apps. It is 100 percent library-agnostic so you can use it with React, Backbone, or any other view library. Moreover, it is really small and has no dependencies, which makes it an awesome tool for React Native projects. Step 1: Install Redux in your React Native project. Redux can be added as an npm dependency into your project. Just navigate to your project’s main folder and type: npm install --save react-redux By the time this article was written React Native was still depending on React Redux 3.1.0 since versions above depended on React 0.14, which is not 100 percent compatible with React Native. Because of this, you will need to force version 3.1.0 as the one to be dependent on in your project. Step 2: Set up a Redux-friendly folder structure. Of course, setting up the folder structure for your project is totally up to every developer but you need to take into account that you will need to maintain a number of actions, reducers, and components. Besides, it’s also useful to keep a separate folder for your API and utility functions so these won’t be mixing with your app’s core functionality. Having this in mind, this is my preferred folder structure under the src folder in any React Native project: Step 3: Create your first action. In this article we will be implementing a simple login functionality to illustrate how to integrate Redux inside React Native. A good point to start this implementation is the action, a basic function called from the component whenever we want the whole state of the app to be changed (i.e. changing from the logged out state into the logged in state). To keep this example as concise as possible we won’t be doing any API calls to a backend – only the pure Redux integration will be explained. Our action creator is a simple function returning an object (the action itself) with a type attribute expressing what happened with the app. No business logic should be placed here; our action creators should be really plain and descriptive. Step 4: Create your first reducer. Reducers are the ones in charge of updating the state of the app. Unlike in Flux, Redux only has one store for the whole app, but it will be conveniently name-spaced automatically by Redux once the reducers have been applied. In our example, the user reducer needs to be aware of when the user is logged in. Because of that, it needs to import the LOGIN_SUCCESS constant we defined in our actions before and export a default function, which will be called by Redux every time an action occurs in the app. Redux will automatically pass the current state of the app and the action occurred. It’s up to the reducer to realize if it needs to modify the state or not based on the action.type. That’s why almost every time our reducer will be a function containing a switch statement, which modifies and returns the state based on what action occurred. It’s important to state that Redux works with object references to identify when the state is changed. Because of this, the state should be cloned before any modification. It’s also interesting to know that the action passed to the reducers can contain other attributes apart from type. For example, when doing a more complex login, the user first name and last name can be added to the action by the action created and used by the reducer to update the state of the app. Step 5: Create your component. This step is almost pure React Native coding. We need a component to trigger the action and to respond to the change of state in the app. In our case it will be a simple View containing a button that disappears when logged in. This is a normal React Native component except for some pieces of the Redux boilerplate: The three import lines at the top will require everything we need from Redux ‘mapStateToProps’ and ‘mapDispatchToProps’ are two functions bound with ‘connect’ to the component: this makes Redux know that this component needs to be passed a piece of the state (everything under ‘userReducers’) and all the actions available in the app. Just by doing this, we will have access to the login action (as it is used in the onLoginButtonPress) and to the state of the app (as it is used in the !this.props.user.loggedIn statement) Step 6: Glue it all from your index.ios.js. For Redux to apply its magic, some initialization should be done in the main file of your React Native project (index.ios.js). This is pure boilerplate and only done once: Redux needs to inject a store holding the app state into the app. To do so, it requires a ‘Provider’ wrapping the whole app. This store is basically a combination of reducers. For this article we only need one reducer, but a full app will include many others and each of them should be passed into the combineReducers function to be taken into account by Redux whenever an action is triggered. About the Author Emilio Rodriguez started working as a software engineer for Sun Microsystems in 2006. Since then, he has focused his efforts on building a number of mobile apps with React Native while contributing to the React Native project. These contributions helped his understand how deep and powerful this framework is.
Read more
  • 0
  • 0
  • 44376

article-image-machine-learning-tasks
Packt
01 Apr 2016
16 min read
Save for later

Machine Learning Tasks

Packt
01 Apr 2016
16 min read
In this article written by David Julian, author of the book Designing Machine Learning Systems with Python, the author wants to state that, he will first introduce the basic machine learning tasks. Classification is probably the most common task, due in part to the fact that it is relatively easy, well understood, and solves a lot of common problems. Multiclass classification (for instance, handwriting recognition) can sometimes be achieved by chaining binary classification tasks. However, we lose information this way, and we become unable to define a single decision boundary. For this reason, multiclass classification is often treated separately from binary classification. (For more resources related to this topic, see here.) There are cases where we are not interested in discrete classes but rather a real number, for instance, a probability. These type of problems are regression problems. Both classification and regression require a training set of correctly labelled data. They are supervised learning problems. Originating from these basic machine tasks are a number of derived tasks. In many applications, this may simply be applying the learning model to a prediction to establish a causal relationship. We must remember that explaining and predicting are not the same. A model can make a prediction, but unless we know explicitly how it made the prediction, we cannot begin to form a comprehensible explanation. An explanation requires human knowledge of the domain. We can also use a prediction model to find exceptions from a general pattern. Here, we are interested in the individual cases that deviate from the predictions. This is often called anomaly detection and has wide applications in areas such as detecting bank fraud, noise filtering, and even in the search for extraterrestrial life. An important and potentially useful task is subgroup discovery. Our goal here is not, as in clustering, to partition the entire domain but rather to find a subgroup that has a substantially different distribution. In essence, subgroup discovery is trying to find relationships between a dependent target variable and many independent explaining variables. We are not trying to find a complete relationship but rather a group of instances that are different in ways that are important in the domain. For instance, establishing the subgroups, smoker = true and family history =true, for a target variable of heart disease =true. Finally, we consider control type tasks. These act to optimize control setting to maximize a pay off is given different conditions. This can be achieved in several ways. We can clone expert behavior; the machine learns directly from a human and makes predictions of actions given different conditions. The task is to learn a prediction model for the expert's actions. This is similar to reinforcement learning, where the task is to learn about the relationship between conditions and optimal action. Clustering, on the other hand, is the task of grouping items without any information on that group; this is an unsupervised learning task. Clustering is basically making a measurement of similarity. Related to clustering is association, which is an unsupervised task to find a certain type of pattern in the data. This task is behind movie recommender systems, and customers who bought this also bought .. on checkout pages of online stores. Data for machine learning When considering raw data for machine learning applications, there are three separate aspects: The volume of the data The velocity of the data The variety of the data Data volume The volume problem can be approached from three different directions: efficiency, scalability, and parallelism. Efficiency is about minimizing the time it takes for an algorithm to process a unit of information. A component of this is the underlying processing power of the hardware. The other component, and one that we have more control over, is ensuring our algorithms are not wasting precious processing cycles on unnecessary tasks. Scalability is really about brute force, and throwing as much hardware at a problem as you can. With Moore's law, which predicts the trend of computer power doubling every two years and reaching its limit, it is clear that scalability is not, by its self, going to be able to keep pace with the ever increasing amounts of data. Simply adding more memory and faster processors is not, in many cases, going to be a cost effective solution. Parallelism is a growing area of machine learning, and it encompasses a number of different approaches from harnessing capabilities of multi core processors, to large scale distributed computing on many different platforms. Probably, the most common method is to simply run the same algorithm on many machines, each with a different set of parameters. Another method is to decompose a learning algorithm into an adaptive sequence of queries, and have these queries processed in parallel. A common implementation of this technique is known as MapReduce, or its open source version, Hadoop. Data velocity The velocity problem is often approached in terms of data producers and data consumers. The data transfer rate between the two is its velocity, and it can be measured in interactive response times. This is the time it takes from a query being made to its response being delivered. Response times are constrained by latencies such as hard disk read and write times, and the time it takes to transmit data across a network. Data is being produced at ever greater rates, and this is largely driven by the rapid expansion of mobile networks and devices. The increasing instrumentation of daily life is revolutionizing the way products and services are delivered. This increasing flow of data has led to the idea of streaming processing. When input data is at a velocity that makes it impossible to store in its entirety, a level of analysis is necessary as the data streams, in essence, deciding what data is useful and should be stored and what data can be thrown away. An extreme example is the Large Hadron Collider at CERN, where the vast majority of data is discarded. A sophisticated algorithm must scan the data as it is being generated, looking at the information needle in the data haystack. Another instance where processing data streams might be important is when an application requires an immediate response. This is becoming increasingly used in applications such as online gaming and stock market trading. It is not just the velocity of incoming data that we are interested in. In many applications, particularly on the web, the velocity of a system's output is also important. Consider applications such as recommender systems, which need to process large amounts of data and present a response in the time it takes for a web page to load. Data variety Collecting data from different sources invariably means dealing with misaligned data structures, and incompatible formats. It also often means dealing with different semantics and having to understand a data system that may have been built on a pretty different set of logical principles. We have to remember that, very often, data is repurposed for an entirely different application than the one it was originally intended for. There is a huge variety of data formats and underlying platforms. Significant time can be spent converting data into one consistent format. Even when this is done, the data itself needs to be aligned such that each record consists of the same number of features and is measured in the same units. Models The goal in machine learning is not to just solve an instance of a problem, but to create a model that will solve unique problems from new data. This is the essence of learning. A learning model must have a mechanism for evaluating its output, and in turn, changing its behavior to a state that is closer to a solution. A model is essentially a hypothesis: a proposed explanation for a phenomenon. The goal is to apply a generalization to the problem. In the case of supervised learning, problem knowledge gained from the training set is applied to the unlabeled test. In the case of an unsupervised learning problem, such as clustering, the system does not learn from a training set. It must learn from the characteristics of the dataset itself, such as degree of similarity. In both cases, the process is iterative. It repeats a well-defined set of tasks, that moves the model closer to a correct hypothesis. There are many models and as many variations on these models as there are unique solutions. We can see that the problems that machine learning systems solve (regression, classification, association, and so on) come up in many different settings. They have been used successfully in almost all branches of science, engineering, mathematics, commerce, and also in the social sciences; they are as diverse as the domains they operate in. This diversity of models gives machine learning systems great problem solving powers. However, it can also be a bit daunting for the designer to decide what is the best model, or models, for a particular problem. To complicate things further, there are often several models that may solve your task, or your task may need several models. The most accurate and efficient pathway through an original problem is something you simply cannot know when you embark upon such a project. There are several modeling approaches. These are really different perspectives that we can use to help us understand the problem landscape. A distinction can be made regarding how a model divides up the instance space. The instance space can be considered all possible instances of your data, regardless of whether each instance actually appears in the data. The data is a subset of the instance space. There are two approaches to dividing up this space: grouping and grading. The key difference between the two is that grouping models divide the instance space into fixed discrete units called segments. Each segment has a finite resolution and cannot distinguish between classes beyond this resolution. Grading, on the other hand, forms a global model over the entire instance space, rather than dividing the space into segments. In theory, the resolution of a grading model is infinite, and it can distinguish between instances no matter how similar they are. The distinction between grouping and grading is not absolute, and many models contain elements of both. Geometric models One of the most useful approaches to machine learning modeling is through geometry. Geometric models use the concept of instance space. The most obvious example is when all the features are numerical and can become coordinates in a Cartesian coordinate system. When we only have two or three features, they are easy to visualize. Since many machine learning problems have hundreds or thousands of features, and therefore dimensions, visualizing these spaces is impossible. Importantly, many of the geometric concepts, such as linear transformations, still apply in this hyper space. This can help us better understand our models. For instance, we expect many learning algorithms to be translation invariant, which means that it does not matter where we place the origin in the coordinate system. Also, we can use the geometric concept of Euclidean distance to measure similarity between instances; this gives us a method to cluster alike instances and form a decision boundary between them. Probabilistic models Often, we will want our models to output probabilities rather than just binary true or false. When we take a probabilistic approach, we assume that there is an underlying random process that creates a well-defined, but unknown, probability distribution. Probabilistic models are often expressed in the form of a tree. Tree models are ubiquitous in machine learning, and one of their main advantages is that they can inform us about the underlying structure of a problem. Decision trees are naturally easy to visualize and conceptualize. They allow inspection and do not just give an answer. For example, if we have to predict a category, we can also expose the logical steps that gave rise to a particular result. Also, tree models generally require less data preparation than other models and can handle numerical and categorical data. On the down side, tree models can create overly complex models that do not generalize very well to new data. Another potential problem with tree models is that they can become very sensitive to changes in the input data, and as we will see later, this problem can be mitigated by using them as ensemble learners. Linear models A key concept in machine learning is that of the linear model. Linear models form the foundation of many advanced nonlinear techniques such as support vector machines and neural networks. They can be applied to any predictive task such as classification, regression, or probability estimation. When responding to small changes in the input data, and provided that our data consists of entirely uncorrelated features, linear models tend to be more stable than tree models. Tree models can over-respond to small variation in training data. This is because splits at the root of a tree have consequences that are not recoverable further down a branch, potentially making the rest of the tree significantly different. Linear models, on the other hand, are relatively stable, being less sensitive to initial conditions. However, as you would expect, this has the opposite effect of making it less sensitive to nuanced data. This is described by the terms variance (for over fitting models) and bias (for under fitting models). A linear model is typically low variance and high bias. Linear models are generally best approached from a geometric perspective. We know we can easily plot two dimensions of space in a Cartesian co-ordinate system, and we can use the illusion of perspective to illustrate a third. We have also been taught to think of time as being a fourth dimension, but when we start speaking of n dimensions, a physical analogy breaks down. Intriguingly, we can still use many of the mathematical tools that we intuitively apply to three dimensions of space. While it becomes difficult to visualize these extra dimensions, we can still use the same geometric concepts (such as lines, planes, angles, and distance) to describe them. With geometric models, we describe each instance as having a set of real-valued features, each of which is a dimension in a space. Model ensembles Ensemble techniques can be divided broadly into two types. The Averaging Method: With this method, several estimators are run independently, and their predictions are averaged. This includes the random forests and bagging methods. The Boosting Methods: With this method, weak learners are built sequentially using weighted distributions of the data, based on the error rates. Ensemble methods use multiple models to obtain better performance than any single constituent model. The aim is to not only build diverse and robust models, but also to work within limitations such as processing speed and return times. When working with large datasets and quick response times, this can be a significant developmental bottleneck. Troubleshooting and diagnostics are important aspects of working with all machine learning models, but they are especially important when dealing with models that might take days to run. The types of machine learning ensembles that can be created are as diverse as the models themselves, and the main considerations revolve around three things: how we divide our data, how we select the models, and the methods we use to combine their results. This simplistic statement actually encompasses a very large and diverse space. Neural nets When we approach the problem of trying to mimic the brain, we are faced with a number of difficulties. Considering all the different things the brain does, we might first think that it consists of a number of different algorithms, each specialized to do a particular task, and each hard wired into different parts of the brain. This approach translates to considering the brain as a number of subsystems, each with its own program and task. For example, the auditory cortex for perceiving sound has its own algorithm that, say, does a Fourier transform on an incoming sound wave to detect the pitch. The visual cortex, on the other hand, has its own distinct algorithm for decoding the signals from the optic nerve and translating them into the sensation of sight. There is, however, growing evidence that the brain does not function like this at all. It appears, from biological studies, that brain tissue in different parts of the brain can relearn how to interpret inputs. So, rather than consisting of specialized subsystems that are programmed to perform specific tasks, the brain uses the same algorithm to learn different tasks. This single algorithm approach has many advantages, not least of which is that it is relatively easy to implement. It also means that we can create generalized models and then train them to perform specialized tasks. Like in real brains, using a singular algorithm to describe how each neuron communicates with the other neurons around it allows artificial neural networks to be adaptable and able to carry out multiple higher-level tasks. Much of the most important work being done with neural net models, and indeed machine learning in general, is through the use of very complex neural nets with many layers and features. This approach is often called deep architecture or deep learning. Human and animal learning occurs at a rate and depth that no machine can match. Many of the elements of biological learning still remain a mystery. One of the key areas of research, and one of the most useful in application, is that of object recognition. This is something quite fundamental to living systems, and higher animals have evolved to possessing an extraordinary ability to learn complex relationships between objects. Biological brains have many layers; each synaptic event exists in a long chain of synaptic processes. In order to recognize complex objects, such as people's faces or handwritten digits, a fundamental task is to create a hierarchy of representation from the raw input to higher and higher levels of abstraction. The goal is to transform raw data, such as a set of pixel values, into something that we can describe as, say, a person riding bicycle. Resources for Article: Further resources on this subject: Python Data Structures [article] Exception Handling in MySQL for Python [article] Python Data Analysis Utilities [article]
Read more
  • 0
  • 0
  • 5074

article-image-integrating-objective-c
Packt
01 Apr 2016
11 min read
Save for later

Integrating with Objective-C

Packt
01 Apr 2016
11 min read
In this article written by Kyle Begeman author of the book Swift 2 Cookbook, we will cover the following recipes: Porting your code from one language to another Replacing the user interface classes Upgrading the app delegate Introduction Swift 2 is out, and we can see that it is going to replace Objective-C on iOS development sooner or later, however how should you migrate your Objective-C app? Is it necessary to rewrite everything again? Of course you don't have to rewrite a whole application in Swift from scratch, you can gradually migrate it. Imagine a four years app developed by 10 developers, it would take a long time to be rewritten. Actually, you've already seen that some of the codes we've used in this book have some kind of "old Objective-C fashion". The reason is that not even Apple computers could migrate the whole Objective-C code into Swift. (For more resources related to this topic, see here.) Porting your code from one language to another In the previous recipe we learned how to add a new code into an existing Objective-C project, however you shouldn't only add new code but also, as far as possible, you should migrate your old code to the new Swift language. If you would like to keep your application core on Objective-C that's ok, but remember that new features are going to be added on Swift and it will be difficult keeping two languages on the same project. In this recipe we are going to port part of the code, which is written in Objective-C to Swift. Getting ready Make a copy of the previous recipe, if you are using any version control it's a good time for committing your changes. How to do it… Open the project and add a new file called Setup.swift, here we are going to add a new class with the same name (Setup): class Setup { class func generate() -> [Car]{ var result = [Car]() for distance in [1.2, 0.5, 5.0] { var car = Car() car.distance = Float(distance) result.append(car) } var car = Car() car.distance = 4 var van = Van() van.distance = 3.8 result += [car, van] return result } } Now that we have this car array generator we can call it on the viewDidLoad method replacing the previous code: - (void)viewDidLoad { [super viewDidLoad]; vehicles = [Setup generate]; [self->tableView reloadData]; } Again press play and check that the application is still working. How it works… The reason we had to create a class instead of creating a function is that you can only export to Objective-C classes, protocols, properties, and subscripts. Bear that in mind in case of developing with the two languages. If you would like to export a class to Objective-C you have two choices, the first one is inheriting from NSObject and the other one is adding the @objc attribute before your class, protocol, property, or subscript. If you paid attention, our method returns a Swift array but it was converted to an NSArray, but as you might know, they are different kinds of array. Firstly, because Swift arrays are mutable and NSArray are not, and the other reason is that their methods are different. Can we use NSArray in Swift? The answer is yes, but I would recommend avoiding it, imagine once finished migrating to Swift your code still follows the old way, it would be another migration. There's more… Migrating from Objective-C is something that you should do with care, don't try to change the whole application at once, remember that some Swift objects behave differently from Objective-C, for example, dictionaries in Swift have the key and the value types specified but in Objective-C they can be of any type. Replacing the user interface classes At this moment you know how to migrate the model part of an application, however in real life we also have to replace the graphical classes. Doing it is not complicated but it could be a bit full of details. Getting ready Continuing with the previous recipe, make a copy of it or just commit the changes you have and let's continue with our migration. How to do it… First create a new file called MainViewController.swift and start importing the UIKit: import UIKit The next step is creating a class called MainViewController, this class must inherit from UIViewController and implement the protocols UITableViewDataSource and UITableViewDelegate: class MainViewController:UIViewController,UITableViewDataSource, UITableViewDelegate {  Then, add the attributes we had in the previous view controller, keep the same name you have used before: private var vehicles = [Car]() @IBOutlet var tableView:UITableView! Next, we need to implement the methods, let's start with the table view data source methods: func tableView(tableView: UITableView, numberOfRowsInSection section: Int) -> Int{ return vehicles.count } func tableView(tableView: UITableView, cellForRowAtIndexPath indexPath: NSIndexPath) -> UITableViewCell{ var cell:UITableViewCell? = self.tableView.dequeueReusableCellWithIdentifier ("vehiclecell") if cell == nil { cell = UITableViewCell(style: .Subtitle, reuseIdentifier: "vehiclecell") } var currentCar = self.vehicles[indexPath.row] cell!.textLabel?.numberOfLines = 1 cell!.textLabel?.text = "Distance (currentCar.distance * 1000) meters" var detailText = "Pax: (currentCar.pax) Fare: (currentCar.fare)" if currentCar is Van{ detailText += ", Volume: ( (currentCar as Van).capacity)" } cell!.detailTextLabel?.text = detailText cell!.imageView?.image = currentCar.image return cell! } Pay attention that this conversion is not 100% equivalent, the fare for example isn't going to be shown with two digits of precision, there is an explanation later of why we are not going to fix this now.  The next step is adding the event, in this case we have to do the action done when the user selects a car: func tableView(tableView: UITableView, willSelectRowAtIndexPath indexPath: NSIndexPath) -> NSIndexPath? { var currentCar = self.vehicles[indexPath.row] var time = currentCar.distance / 50.0 * 60.0 UIAlertView(title: "Car booked", message: "The car will arrive in (time) minutes", delegate: nil, cancelButtonTitle: "OK").show() return indexPath } As you can see, we need only do one more step to complete our code, in this case it's the view didLoad. Pay attention that another difference from Objective-C and Swift is that in Swift you have to specify that you are overloading an existing method: override func viewDidLoad() { super.viewDidLoad() vehicles = Setup.generate() self.tableView.reloadData() } } // end of class Our code is complete, but of course our application is still using the old code. To complete this operation, click on the storyboard, if the document outline isn't being displayed, click on the Editor menu and then on Show Document Outline: Now that you can see the document outline, click on View Controller that appears with a yellow circle with a square inside: Then on the right-hand side, click on the identity inspector, next go to the custom class and change the value of the class from ViewController to MainViewController. After that, press play and check that your application is running, select a car and check that it is working. Be sure that it is working with your new Swift class by paying attention on the fare value, which in this case isn't shown with two digits of precision. Is everything done? I would say no, it's a good time to commit your changes. Lastly, delete the original Objective-C files, because you won't need them anymore. How it works… As you can see, it's not so hard replacing an old view controller with a Swift one, the first thing you need to do is create a new view controller class with its protocols. Keep the same names you had on your old code for attributes and methods that are linked as IBActions, it will make the switch very straightforward otherwise you will have to link again. Bear in mind that you need to be sure that your changes are applied and that they are working, but sometimes it is a good idea to have something different, otherwise your application can be using the old Objective-C and you didn't realize it. Try to modernize our code using the Swift way instead of the old Objective-C style, for example, nowadays it's preferable using interpolation rather than using stringWithFormat. We also learned that you don't need to relink any action or outlet if you keep the same name. If you want to change the name of anything you might first keep its original name, test your app, and after that you can refactor it following the traditional factoring steps. Don't delete the original Objective-C files until you are sure that the equivalent Swift file is working on every functionality. There's more… This application had only one view controller, however applications usually have more than one view controller. In this case, the best way you can update them is one by one instead of all of them at the same time. Upgrading the app delegate As you know there is an object that controls the events of an application, which is called application delegate. Usually you shouldn't have much code here, but a few of them you might have. For example, you may deactivate the camera or the GPS requests when your application goes to the background and reactivate them when the app returns active. Certainly it is a good idea to update this file even if you don't have any new code on it, so it won't be a problem in the future. Getting ready If you are using the version control system, commit your changes from the last recipe or if you prefer just copy your application. How to do it… Open the previous application recipe and create a new Swift file called ApplicationDelegate.swift, then you can create a class with the same name. As in our previous class we didn't have any code on the application delegate, so we can differentiate it by printing on the log console. So add this traditional application delegate on your Swift file: class ApplicationDelegate: UIResponder, UIApplicationDelegate { var window: UIWindow? func application(application: UIApplication, didFinishLaunchingWithOptions launchOptions: [NSObject: AnyObject]?) -> Bool { print("didFinishLaunchingWithOptions") return true } func applicationWillResignActive(application: UIApplication) { print("applicationWillResignActive") } func applicationDidEnterBackground(application: UIApplication) { print("applicationDidEnterBackground") } func applicationWillEnterForeground(application: UIApplication) { print("applicationWillEnterForeground") } func applicationDidBecomeActive(application: UIApplication) { print("applicationDidBecomeActive") } func applicationWillTerminate(application: UIApplication) { print("applicationWillTerminate") } } Now go to your project navigator and expand the Supporting Files group, after that click on the main.m file. In this file we are going to import the magic file, the Swift header file: #import "Chapter_8_Vehicles-Swift.h" After that we have to specify that the application delegate is the new class we have, so replace the AppDelegate class on the UIApplicationMain call with ApplicationDelegate. Your main function should be like this: int main(int argc, char * argv[]) { @autoreleasepool { return UIApplicationMain(argc, argv, nil, NSStringFromClass([ApplicationDelegate class])); } } It's time to press play and check whether the application is working or not. Press the home button or the combination shift + command + H if you are using the simulator and again open your application. Have a look that you have some messages on your log console. Now that you are sure that your Swift code is working, remove the original app delegate and its importation on the main.m. Test your app just in case. You could consider that we had finished this part, but actually we still have another step to do: removing the main.m file. Now it is very easy, just click on the ApplicationDelegate.swift file and before the class declaration add the attribute @UIApplicationMain, then right click on the main.h and choose to delete it. Test it and your application is done. How it works… The application delegate class has always been specified at the starting of an application. In Objective-C, it follows the C start point, which is a function called main. In iOS, you can specify the class that you want to use as an application delegate. If you program for OS X the procedure is different, you have to go to your nib file and change its class name to the new one. Why did we have to change the main function and then eliminate it? The reason is that you should avoid massive changes, if something goes wrong you won't know the step where you failed, so probably you will have to rollback everything again. If you do your migration step by step ensuring that it is still working, it means that in case of finding an error, it will be easier to solve it. Avoid doing massive changes on your project, changing step by step will be easier to solve issues. There's more… In this recipe, we learned the last steps of how to migrate an app from Objective-C to Swift code, however we have to remember that programming is not only about applications, you can also have a framework. In the next recipe, we are going to learn how to create your own framework compatible with Swift and Objective-C. Summary This article shows you how Swift and Objective-C can live together and give you a step-by-step guide on how to migrate your Objective-C app to Swift. Resources for Article: Further resources on this subject: Concurrency and Parallelism with Swift 2 [article] Swift for Open Source Developers [article] Your First Swift 2 Project [article]
Read more
  • 0
  • 0
  • 14500

article-image-testing-and-debugging-distributed-applications
Packt
01 Apr 2016
21 min read
Save for later

Testing and Debugging Distributed Applications

Packt
01 Apr 2016
21 min read
In this article, by Francesco Pierfederici author of the book Distributed Computing with Python, the author likes to state that, "distributed systems, both large and small, can be extremely challenging to test and debug, as they are spread over a network, run on computers that can be quite different from each other, and might even be physically located in different continents altogether". Moreover, the computers we use could have different user accounts, different disks with different software packages, different hardware resources, and very uneven performance. Some can even be in a different time zone. Developers of distributed systems need to consider all these pieces of information when trying to foresee failure conditions. Operators have to work around all of these challenges when debugging errors. (For more resources related to this topic, see here.) The big picture Testing and debugging monolithic applications is not simple, as every developer knows. However, there are a number of tools that dramatically make the task easier, including the pdb debugger, various profilers (notable mentions include cProfile and line_profile), linters, static code analysis tools, and a host of test frameworks, a number of which have been included in the standard library of Python 3.3 and higher. The challenge with distributed applications is that most of the tools and packages that we can use for single-process applications lose much of their power when dealing with multiple processes, especially when these processes run on different computers. Debugging and profiling distributed applications written in C, C++, and Fortran can be done with tools such as Intel VTune, Allinea MAP, and DDT. Unfortunately, Python developers are left with very few or no options for the time being. Writing small- or medium-sized distributed systems is not terribly hard, as we saw in the pages so far. The main difference between writing monolithic programs and distributed applications is the large number of interdependent components running on remote hardware. This is what makes monitoring and debugging distributed code harder and less convenient. Fortunately, we can still use all familiar debuggers and code analysis tools on our Python distributed applications. Unfortunately, these tools will only go so far to the point that we will have to rely on old-fashioned logging and print statements to get the full picture on what went wrong. Common problems – clocks and time Time is a handy variable for use. For instance, using timestamps is very natural when we want to join different streams of data, sort database records, and in general, reconstruct the timeline for a series of events, which we often times observe are out of order. In addition, some tools (for example, GNU make) rely solely on file modification time and are easily confused by a clock skew between machines. For these reasons, clock synchronization among all computers and systems we use is very important. If our computers are in different time zones, we might want to not only synchronize their clocks but also set them to Coordinated Universal Time (UTC) for simplicity. In all the cases, when changing clocks to UTC is not possible, a good advice is to always process time in UTC within our code and to only covert local time for display purposes. In general, clock synchronization in distributed systems is a fascinating and complex topic, and it is out of the scope of this article. Most readers, luckily, are likely to be well served by the Network Time Protocol (NTP), which is a perfectly fine synchronization solution for most situations. Most modern operating systems, including Windows, Mac OS X, and Linux, have great support for NTP. Another thing to consider when talking about time is the timing of periodic actions, such as polling loops or cronjobs. Many applications need to spawn processes or perform actions (for example, sending a confirmation e-mail or checking whether new data is available) at regular intervals. A common pattern is to set up timers (either in our code or via the tools provided by the OS) and have all these timers go off at some time, usually at a specific hour and at regular intervals after that. The risk of this approach is that we can overload the system the very moment all these processes wake up and start their work. A surprisingly common example would be starting a significant number of processes that all need to read some configuration or data file from a shared disk. In these cases, everything goes fine until the number of processes becomes so large that the shared disk cannot handle the data transfer, thus slowing our application to a crawl. A common solution is the staggering of these timers in order to spread them out over a longer time interval. In general, since we do not always control all code that we use, it is good practice to start our timers at some random number of minutes past the hour, just to be safe. Another example of this situation would be an image-processing service that periodically polls a set of directories looking for new data. When new images are found, they are copied to a staging area, renamed, scaled, and potentially converted to a common format before being archived for later use. If we're not careful, it would be easy to overload the system if many images were to be uploaded at once. A better approach would be to throttle our application (maybe using a queue-based architecture) so that it would only start an appropriately small number of image processors so as to not flood the system. Common problems – software environments Another common challenge is making sure that the software installed on all the various machines we are ever going to use is consistent and consistently upgraded. Unfortunately, it is frustratingly common to spend hours debugging a distributed application only to discover that for some unknown and seemingly impossible reason, some computers had an old version of the code and/or its dependencies. Sometimes, we might even find the code to have disappeared completely. The reasons for these discrepancies can be many: from a mount point that failed to a bug in our deployment procedures to a simple human mistake. A common approach, especially in the HPC world, is to always create a self-contained environment for our code before launching the application itself. Some projects go as far as preferring static linking of all dependencies to avoid having the runtime pick up the wrong version of a dynamic library. This approach works well if the application runtime is long compared to the time it takes to build its full environment, all of its software dependencies, and the application itself. It is not that practical otherwise. Python, fortunately, has the ability to create self-contained virtual environments. There are two related tools that we can use: pyvenv (available as part of the Python 3.5 standard library) and virtualenv (available in PyPI). Additionally, pip, the Python package management system, allows us to specify the exact version of each package we want to install in a requirements file. These tools, when used together, permit reasonable control on the execution environment. However, the devil, as it is often said, is in the details, and different computer nodes might use the exact same Python virtual environment but incompatible versions of some external library. In this respect, container technologies such as Docker (https://www.docker.com) and, in general, version-controlled virtual machines are promising ways out of the software runtime environment maelstrom in those environments where they can be used. In all other cases, HPC clusters come to mind, the best approach will probably be to not rely on the system software and manage our own environments and the full-software stack. Common problems – permissions and environments Different computers might have run our code under different user accounts, and our application might expect to be able to read a file or write data into a specific directory and hit an unexpected permission error. Even in cases where the user accounts used by our code are all the same (down to the same user ID and group ID), their environment may be different on different hosts. Therefore, an environment variable we assumed to be defined might not be or, even worse, might be set to an incompatible value. These problems are common when our code runs as a special, unprivileged user such as nobody. Defensive coding, especially when accessing the environment, and making sure to always fall back to sensible defaults when variables are undefined (that is, value = os.environ.get('SOME_VAR', fallback_value) instead of simply value = os.environ.get['SOME_VAR']) is often necessary. A common approach, when this is possible, is to only run our applications under a specific user account that we control and specify the full set of environment variables our code needs in the deployment and application startup scripts (which will have to be version controlled as well). Some systems, however, not only execute jobs under extremely limited user accounts, but they also restrict code execution to temporary sandboxes. In many cases, access to the outside network is also blocked. In these situations, one might have no other choice but to set up the full environment locally and copy it to a shared disk partition. Other data can be served from custom-build servers running as ancillary jobs just for this purpose. In general, permission problems and user environment mismatches are very similar to problems with the software environment and should be tackled in concert. Often times, developers find themselves wanting to isolate their code from the system as much as possible and create a small, but self-contained environment with all the code and all the environment variables they need. Common problems – the availability of hardware resources The hardware resources that our application needs might or might not be available at any given point in time. Moreover, even if some resources were to be available at some point in time, nothing guarantees that they will stay available for much longer. A problems we can face related to this are network glitches, which are quite common in many environments (especially for mobile apps) and which, for most practical purposes, are undistinguishable from machine or application crashes. Applications using a distributed computing framework or job scheduler can often rely on the framework itself to handle at least some common failure scenarios. Some job schedulers will even resubmit our jobs in case of errors or sudden machine unavailability. Complex applications, however, might need to implement their own strategies to deal with hardware failures. In some cases, the best strategy is to simply restart the application when the necessary resources are available again. Other times, restarting from scratch would be cost prohibitive. In these cases, a common approach is to implement application checkpointing. What this means is that the application both writes its state to a disk periodically and is able to bootstrap itself from a previously saved state. In implementing a checkpointing strategy, you need to balance the convenience of being able to restart an application midway with the performance hit of writing a state to a disk. Another consideration is the increase in code complexity, especially when many processes or threads are involved in reading and writing state information. A good rule of thumb is that data or results that can be recreated easily and quickly do not warrant implementation of application checkpointing. If, on the other hand, some processing requires a significant amount of time and one cannot afford to waste it, then application checkpointing might be in order. For example, climate simulations can easily run for several weeks or months at a time. In these cases, it is important to checkpoint them every hour or so, as restarting from the beginning after a crash would be expensive. On the other hand, a process that takes an uploaded image and creates a thumbnail for, say, a web gallery runs quickly and is not normally worth checkpointing. To be safe, a state should always be written and updated automatically (for example, by writing to a temporary file and replacing the original only after the write completes successfully). The last thing we want is to restart from a corrupted state! Very familiar to HPC users as well as users of AWS, a spot instance is a situation where a fraction or the entirety of the processes of our application are evicted from the machines that they are running on. When this happens, a warning is typically sent to our processes (usually, a SIGQUIT signal) and after a few seconds, they are unceremoniously killed (via a SIGKILL signal). For AWS spot instances, the time of termination is available through a web service in the instance metadata. In either case, our applications are given some time to save the state and quit in an orderly fashion. Python has powerful facilities to catch and handle signals (refer to the signal module). For example, the following simple commands shows how we can implement a bare-bones checkpointing strategy in our application: #!/usr/bin/env python3.5 """ Simple example showing how to catch signals in Python """ import json import os import signal import sys     # Path to the file we use to store state. Note that we assume # $HOME to be defined, which is far from being an obvious # assumption! STATE_FILE = os.path.join(os.environ['HOME'],                                '.checkpoint.json')     class Checkpointer:     def __init__(self, state_path=STATE_FILE):         """         Read the state file, if present, and initialize from that.         """         self.state = {}         self.state_path = state_path         if os.path.exists(self.state_path):             with open(self.state_path) as f:                 self.state.update(json.load(f))         return       def save(self):         print('Saving state: {}'.format(self.state))         with open(self.state_path, 'w') as f:             json.dump(self.state, f)         return       def eviction_handler(self, signum, frame):         """         This is the function that gets called when a signal is trapped.         """         self.save()           # Of course, using sys.exit is a bit brutal. We can do better.         print('Quitting')         sys.exit(0)         return     if __name__ == '__main__':     import time       print('This is process {}'.format(os.getpid()))       ckp = Checkpointer()     print('Initial state: {}'.format(ckp.state))       # Catch SIGQUIT     signal.signal(signal.SIGQUIT, ckp.eviction_handler)       # Get a value from the state.     i = ckp.state.get('i', 0)     try:         while True:             i += 1             ckp.state['i'] = i             print('Updated in-memory state: {}'.format(ckp.state))             time.sleep(1)     except KeyboardInterrupt:         ckp.save() If we run the preceding script in a terminal window and then in another terminal window, we send it a SIGQUIT signal (for example, via kill -s SIGQUIT <process id>). After this, we see the checkpointing in action, as the following screenshot illustrates: A common situation in distributed applications is that of being forced to run code in potentially heterogeneous environments: machines (real or virtual) of different performances, with different hardware resources (for example, with or without GPUs), and potentially different software environments (as we mentioned already). Even in the presence of a job scheduler, to help us choose the right software and hardware environment, we should always log the full environment as well as the performance of each execution machine. In advanced architectures, these performance metrics can be used to improve the efficiency of job scheduling. PBS Pro, for instance, takes into consideration the historical performance figures of each job being submitted to decide where to execute it next. HTCondor continuously benchmarks each machine and makes those figures available for node selection and ranking. Perhaps, the most frustrating cases are where either due to the network itself or due to servers being overloaded, network requests take so long that our code hits its internal timeouts. This might lead us to believe that the counterpart service is not available. These bugs, especially when transient, can be quite hard to debug. Challenges – the development environment Another common challenge in distributed systems is the setup of a representative development and testing environment, especially for individuals or small teams. Ideally, in fact, the development environment should be identical to the worst-case scenario deployment environment. It should allow developers to test common failure scenarios, such as a disk filling up, varying network latencies, intermittent network connections, hardware and software failures, and so on—all things that are bound to happen in real time, sooner or later. Large teams have the resources to set up development and test clusters, and they almost always have dedicated software quality teams stress testing our code. Small teams, unfortunately, often find themselves forced to write code on their laptops and use a very simplified (and best-case scenario!) environment made up by two or three virtual machines running on the laptops themselves to emulate the real system. This pragmatic solution works and is definitely better than nothing. However, we should remember that virtual machines running on the same host exhibit unrealistically high-availability and low-network latencies. In addition, nobody will accidentally upgrade them without us knowing or image them with the wrong operating system. The environment is simply too controlled and stable to be realistic. A step closer to a realistic setup would be to create a small development cluster on, say, AWS using the same VM images, with the same software stack and user accounts that we are going to use in production. All things said, there is simply no replacement for the real thing. For cloud-based applications, it is worth our while to at least test our code on a smaller version of the deployment setup. For HPC applications, we should be using either a test cluster, a partition of the operational cluster, or a test queue for development and testing. Ideally, we would develop on an exact clone of the operational system. Cost consideration and ease of development will constantly push us to the multiple-VMs-on-a-laptop solution; it is simple, essentially free, and it works without an Internet connection, which is an important point. We should, however, keep in mind that distributed applications are not impossibly hard to write; they just have more failure modes than their monolithic counterparts do. Some of these failure modes (especially those related to data access patterns) typically require a careful choice of architecture. Correcting architectural choices dictated by false assumptions later on in the development stage can be costly. Convincing managers to give us the hardware resources that we need early on is usually difficult. In the end, this is a delicate balancing act. A useful strategy – logging everything Often times, logging is like taking backups or eating vegetables—we all know we should do it, but most of us forget. In distributed applications, we simply have no other choice—logging is essential. Not only that, logging everything is essential. With many different processes running on potentially ephemeral remote resources at difficult-to-predict times, the only way to understand what happens is to have logging information and have it readily available and in an easily searchable format/system. At the bare minimum, we should log process startup and exit time, exit code and exceptions (if any), all input arguments, all outputs, the full execution environment, the name and IP of the execution host, the current working directory, the user account as well as the full application configuration, and all software versions. The idea is that if something goes wrong, we should be able to use this information to log onto the same machine (if still available), go to the same directory, and reproduce exactly what our code was doing. Of course, being able to exactly reproduce the execution environment might simply not be possible (often times, because it requires administrator privileges). However, we should always aim to be able to recreate a good approximation of that environment. This is where job schedulers really shine; they allow us to choose a specific machine and specify the full job environment, which makes replicating failures easier. Logging software versions (not only the version of the Python interpreter, but also the version of all the packages used) helps diagnose outdated software stacks on remote machines. The Python package manager, pip, makes getting the list of installed packages easy: import pip; pip.main(['list']). Whereas, import sys; print(sys.executable, sys.version_info) displays the location and version of the interpreter. It is also useful to create a system whereby all our classes and function calls emit logging messages with the same level detail and at the same points in the object life cycle. Common approaches involve the use of decorators and, maybe a bit too esoteric for some, metaclasses. This is exactly what the autologging Python module (available on PyPI) does for us. Once logging is in place, we face the questions where to store all these logging messages and whose traffic could be substantial for high verbosity levels in large applications. Simple installations will probably want to write log messages to text files on a disk. More complex applications might want to store these messages in a database (which can be done by creating a custom handler for the Python logging module) or in specialized log aggregators such as Sentry (https://getsentry.com). Closely related to logging is the issue of monitoring. Distributed applications can have many moving parts, and it is often essential to know which machines are up, which are busy, as well as which processes or jobs are currently running, waiting, or in an error state. Knowing which processes are taking longer than usual to complete their work is often an important warning sign that something might be wrong. Several monitoring solutions for Python (often times, integrated with our logging system) exist. The Celery project, for instance, recommends flower (http://flower.readthedocs.org) as a monitoring and control web application. HPC job schedulers, on the other hand, tend to lack common, general-purpose, monitoring solutions that go beyond simple command-line clients. Monitoring comes in handy in discovering potential problems before they become serious. It is in fact useful to monitor resources such as available disk space and trigger actions or even simple warning e-mails when they fall under a given threshold. Many centers monitor hardware performance and hard drive SMART data to detect early signs of potential problems. These issues are more likely to be of interest to operations personnel rather than developers, but they are useful to keep in mind. They can also be integrated in our applications to implement strategies in order to handle performance degradations gracefully. A useful strategy – simulating components A good, although possibly expensive in terms of time and effort, test strategy is to simulate some or all of the components of our system. The reasons are multiple; on one hand, simulating or mocking software components allows us to test our interfaces to them more directly. In this respect, mock testing libraries, such as unittest.mock (part of the Python 3.5 standard library), are truly useful. Another reason to simulate software components is to make them fail or misbehave on demand and see how our application responds. For instance, we could increase the response time of services such as REST APIs or databases to worst-case scenario levels and see what happens. Sometimes, we might exceed timeout values in some network calls leading our application to incorrectly assume that the sever has crashed. Especially early on in the design and development of a complex distributed application, one can make overly optimistic assumptions about things such as network availability and performance or response time of services such as databases or web servers. For this reason, having the ability to either completely bring a service offline or, more subtly, modify its behavior can tell us a lot about which of the assumptions in our code might be overly optimistic. The Netflix Chaos Monkey (https://github.com/Netflix/SimianArmy) approach of disabling random components of our system to see how our application copes with failures can be quite useful. Summary Writing and running small- or medium-sized distributed applications in Python is not hard. There are many high-quality frameworks that we can leverage among others, for example, Celery, Pyro, various job schedulers, Twisted, MPI bindings, or the multiprocessing module in the standard library. The real difficulty, however, lies in monitoring and debugging our applications, especially because a large fraction of our code runs concurrently on many different, often remote, computers. The most insidious bugs are those that end up producing incorrect results (for example, because of data becoming corrupted along the way) rather than raising an exception, which most frameworks are able to catch and bubble up. The monitoring and debugging tools that we can use with Python code are, sadly, not as sophisticated as the frameworks and libraries we use to develop that same code. The consequence is that large teams end up developing their own, often times, very specialized distributed debugging systems from scratch and small teams mostly rely on log messages and print statements. More work is needed in the area of debuggers for distributed applications in general and for dynamic languages such as Python in particular. Resources for Article: Further resources on this subject: Python Data Structures [article] Python LDAP applications - extra LDAP operations and the LDAP URL library [article] Machine Learning Tasks [article]
Read more
  • 0
  • 0
  • 3629
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-launching-spark-cluster
Packt
31 Mar 2016
7 min read
Save for later

Launching a Spark Cluster

Packt
31 Mar 2016
7 min read
 In this article by Omar Khedher, author of OpenStack Sahara Essentials we will use Sahara to create and launch a Spark cluster. Sahara provides several plugins to provision Hadoop clusters on top of OpenStack. We will be using Spark plugins to provision Apache Spark clusters using Horizon. (For more resources related to this topic, see here.) General settings The following diagram illustrates our Spark cluster topology, which includes: One Spark master node: This runs the Spark Master and the HDFS NameNode Three Spark slave nodes: These run a Spark Slave and an HDFS DataNode each Preparing the Spark image The following link provides several Sahara images available for download for different plugins: http://sahara-files.mirantis.com/images/upstream/liberty. Note that the upstream Sahara image files are destined for the OpenStack Liberty release. From Horizon, click on Compute and select Images, click on Create Image and add the new image, as shown here: We will need to upload the downloaded image to Glance so that it can be registered in the Sahara image registry catalog. Make sure that the new image is active. Click on the Data Processing tab and select Image Registry. Click on Register Image to register the new uploaded Glance image to Sahara, as shown here: Click on Done and the new Spark image is ready to start launching the Spark cluster. Creating the Spark master group template Node group templates in Sahara facilitate the configuration of a set of instances that have same properties, such as RAM and CPU. We will start by creating the first node group template for the Spark master. From the Data Processing tab, select Node Group Templates and click on Create Template. Our first node group template will be based on Apache Spark with Version 1.3.1, as shown here: The next wizard will guide to specifying the name of the template, the instance flavor, the storage location, and which floating IP pool will be assigned to the cluster instance: The next tab in same wizard will guide you to selecting which kind of process the nodes in the cluster will run. In our case, the Spark master node group template will include Spark master and HDFS namenode processes, as shown here: The next tab in the wizard exposes more choices regarding the security groups that will be applied for the template cluster nodes: Auto security group: This will automatically create a set of security groups that will be directly applied to the instances of the node group template Default security group: Any existing security groups in the OpenStack environment configured as default will be applied the instances of the node group template The last tab in the wizard exposes more specific HDFS configuration that depend on the available resources of the cluster, such as disk space, CPU and memory: dfs.datanode.handler.count: How many server threads there are for the datanode dfs.datanode.du.reserved: How much of the available disk space will not be taken into account for HDFS use dfs.namenode.handler.count: How many server threads there are for the namenode dfs.datanode.failed.volumes.tolerated: How many volumes are allowed to fail before a datanode instance stops dfs.datanode.max.xcievers: What is the maximum number of threads to be used in order to transfer data to/from the DataNode instance. Name Node Heap Size: How much memory will be assigned to the heap size to handle workload per NameNode instance Data Node Heap Size: How much memory will be assigned to the heap size to handle workload per DataNode instance Creating the Spark slave group template Creating the Spark slave group template will be performed in the same way as the Spark master group template except the assignment of the node processes. The Spark slave nodes will be running Spark slave and HDFS datanode processes, as shown here: Security groups and HDFS parameters can be configured the same as the Spark master node group template. Creating the Spark cluster template Now that we have defined the basic templates for the Spark cluster, we will need to compile both entities into one cluster template. In the Sahara dashboard, select Cluster Templates and click on Create Template. Select Apache Spark as the Plugin name, with version 1.3.1, as follows: Give the cluster template a name and small description. It is also possible to mention which process in the Spark cluster will run in a different compute node for high-availability purposes. This is only valid when you have more than one compute node in the OpenStack environment. The next tab in the same wizard allows you to add the necessary number of Spark instances based on the node group templates created previously. In our case, we will use one master Spark instance and three slave Spark instances, as shown here: The next tab, General Parameters, provides more advanced cluster configuration, including the following: Timeout for disk preparing: The cluster will fail when the duration of formatting and mounting the disk per node exceeds the timeout value. Enable NTP service: This option will enable all the instances of the cluster to synchronized time. An NTP file can be found under /tmp when cluster nodes are active. URL of NTP server: If mentioned, the Spark cluster will use the URL of the NTP server for time synchronization. Heat Wait Condition timeout: Heat will throw an error message to Sahara and the cluster will fail when a node is not able to boot up after a certain amount of time. This will prevent Sahara spawning instances indefinitely. Enable XFS: Allows XFS disk formatting. Decommissioning Timeout: This will throw an error when scaling data nodes in the Spark cluster takes more than the time mentioned. Enable Swift: Allows using Swift object storage to pull and push data during job execution. The Spark Parameters tab allows you to specify the following: Master webui port: Which port will access the Spark master web user interface. Work webui port: Which port will access the Spark slave web user interface. Worker memory: How much memory will be reserved for Spark applications. By default, if all is selected, Spark will use all the available RAM is the instance minus 1 GB. Spark will not run properly when using a flavor having RAM less than 1 GB. Launching the Spark cluster Based on the cluster template, the last step will require you to only push the button Launch Cluster from the Clusters tab in the Sahara dashboard. You will need only to select the plugin name, Apache Spark, with version 1.3.1. Next, you will need to name the new cluster, select the right cluster template created previously, and the base image registered in Sahara. Additionally, if you intend to access the cluster instances via SSH, select an existing SSH keypair. It is also possible to select from which network segment you will be able to manage the cluster instances; in our case, an existing private network, Private_Net10, will be used for this purpose. Launch the cluster; this will take a while to finish spawning four instances forming the Spark cluster. The Spark cluster instances can be listed in the Compute Instances tab, as shown here: Summary In this article, we created a Spark cluster using Sahara in OpenStack by means of the Apache Spark plugin. The provisioned cluster includes one Spark master node and three Spark slave nodes. When the cluster status changes to theactive state, it is possible to start executing jobs. Resources for Article:  Further resources on this subject: Introducing OpenStack Trove [article] OpenStack Performance, Availability [article] Monitoring Physical Network Bandwidth Using OpenStack Ceilometer [article]
Read more
  • 0
  • 0
  • 2945

article-image-why-mesos
Packt
31 Mar 2016
8 min read
Save for later

Why Mesos?

Packt
31 Mar 2016
8 min read
In this article by Dipa Dubhasi and Akhil Das authors of the book Mastering Mesos, delves into understanding the importance of Mesos. Apache Mesos is an open source, distributed cluster management software that came out of AMPLab, UC Berkeley in 2011. It abstracts CPU, memory, storage, and other computer resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. It is referred to as a metascheduler (scheduler of schedulers) and a "distributed systems kernel/distributed datacenter OS". It improves resource utilization, simplifies system administration, and supports a wide variety of distributed applications that can be deployed by leveraging its pluggable architecture. It is scalable and efficient and provides a host of features, such as resource isolation and high availability, which, along with a strong and vibrant open source community, makes this one of the most exciting projects. (For more resources related to this topic, see here.) Introduction to the datacenter OS and architecture of Mesos Over the past decade, datacenters have graduated from packing multiple applications into a single server box to having large datacenters that aggregate thousands of servers to serve as a massively distributed computing infrastructure. With the advent of virtualization, microservices, cluster computing, and hyper-scale infrastructure, the need of the hour is the creation of an application-centric enterprise that follows a software-defined datacenter strategy. Currently, server clusters are predominantly managed individually, which can be likened to having multiple operating systems on the PC, one each for processor, disk drive, and so on. With an abstraction model that treats these machines as individual entities being managed in isolation, the ability of the datacenter to effectively build and run distributed applications is greatly reduced. Another way of looking at the situation is comparing running applications in a datacenter to running them on a laptop. One major difference is that while launching a text editor or web browser, we are not required to check which memory modules are free and choose ones that suit our need. Herein lies the significance of a platform that acts like a host operating system and allows multiple users to run multiple applications simultaneously by utilizing a shared set of resources. Datacenters now run varied distributed application workloads, such as Spark, Hadoop, and so on, and need the capability to intelligently match resources and applications. The datacenter ecosystem today has to be equipped to manage and monitor resources and efficiently distribute workloads across a unified pool of resources with the agility and ease to cater to a diverse user base (noninfrastructure teams included). A datacenter OS brings to the table a comprehensive and sustainable approach to resource management and monitoring. This not only reduces the cost of ownership but also allows a flexible handling of resource requirements in a manner that isolated datacenter infrastructure cannot support. The idea behind a datacenter OS is that of an intelligent software that sits above all the hardware in a datacenter and ensures efficient and dynamic resource sharing. Added to this is the capability to constantly monitor resource usage and improve workload and infrastructure management in a seamless way that is not tied to specific application requirements. In its absence, we have a scenario with silos in a datacenter that force developers to build software catering to machine-specific characteristics and make the moving and resizing of applications a highly cumbersome procedure. The datacenter OS acts as a software layer that aggregates all servers in a datacenter into one giant supercomputer to deliver the benefits of multilatency, isolation, and resource control across all microservice applications. Another major advantage is the elimination of human-induced error during the continual assigning and reassigning of virtual resources. From a developer's perspective, this will allow them to easily and safely build distributed applications without restricting them to a bunch of specialized tools, each catering to a specific set of requirements. For instance, let's consider the case of Data Science teams who develop analytic applications that are highly resource intensive. An operating system that can simplify how the resources are accessed, shared, and distributed successfully alleviates their concern about reallocating hardware every time the workloads change. Of key importance is the relevance of the datacenter OS to DevOps, primarily a software development approach that emphasizes automation, integration, collaboration, and communication between traditional software developers and other IT professionals. With a datacenter OS that effectively transforms individual servers into a pool of resources, DevOps teams can focus on accelerating development and not continuously worry about infrastructure issues. In a world where distributed computing becomes the norm, the datacenter OS is a boon in disguise. With freedom from manually configuring and maintaining individual machines and applications, system engineers need not configure specific machines for specific applications as all applications would be capable of running on any available resources from any machine, even if there are other applications already running on them. Using a datacenter OS results in centralized control and smart utilization of resources that eliminate hardware and software silos to ensure greater accessibility and usability even for noninfrastructural professionals. Examples of some organizations administering their hyperscale datacenters via the datacenter OS are Google with the Borg (and next geneneration Omega) systems. The merits of the datacenter OS are undeniable, with benefits ranging from the scalability of computing resources and flexibility to support data sharing across applications to saving team effort, time, and money while launching and managing interoperable cluster applications. It is this vision of transforming the datacenter into a single supercomputer that Apache Mesos seeks to achieve. Born out of a Berkeley AMPLab research paper in 2011, it has since come a long way with a number of leading companies, such as Apple, Twitter, Netflix, and AirBnB among others, using it in production. Mesosphere is a start-up that is developing a distributed OS product with Mesos at its core. The architecture of Mesos Mesos is an open-source platform for sharing clusters of commodity servers between different distributed applications (or frameworks), such as Hadoop, Spark, and Kafka among others. The idea is to act as a centralized cluster manager by pooling together all the physical resources of the cluster and making it available as a single reservoir of highly available resources for all the different frameworks to utilize. For example, if an organization has one 10-node cluster (16 CPUs and 64 GB RAM) and another 5-node cluster (4 CPUs and 16 GB RAM), then Mesos can be leveraged to pool them into one virtual cluster of 720 GB RAM and 180 CPUs, where multiple distributed applications can be run. Sharing resources in this fashion greatly improves cluster utilization and eliminates the need for an expensive data replication process per-framework. Some of the important features of Mesos are: Scalability: It can elastically scale to over 50,000 nodes Resource isolation: This is achieved through Linux/Docker containers Efficiency: This is achieved through CPU and memory-aware resource scheduling across multiple frameworks High availability: This is through Apache ZooKeeper Interface: A web UI for monitoring the cluster state Mesos is based on the same principles as the Linux kernel and aims to provide a highly available, scalable, and fault-tolerant base for enabling various frameworks to share cluster resources effectively and in isolation. Distributed applications are varied and continuously evolving, a fact that leads Mesos' design philosophy towards a thin interface that allows an efficient resource allocation between different frameworks and delegates the task of scheduling and job execution to the frameworks themselves. The two advantages of doing so are: Different frame data replication works can independently devise methods to address their data locality, fault-tolerance, and other such needs. It simplifies the Mesos codebase and allows it to be scalable, flexible, robust, and agile Mesos' architecture hands over the responsibility of scheduling tasks to the respective frameworks by employing a resource offer abstraction that packages a set of resources and makes offers to each framework. The Mesos master node decides the quantity of resources to offer each framework, while each framework decides which resource offers to accept and which tasks to execute on these accepted resources. This method of resource allocation is shown to achieve good degree of data locality for each framework sharing the same cluster. An alternative architecture would implement a global scheduler that took framework requirements, organizational priorities, and resource availability as inputs and provided a task schedule breakdown by framework and resource as output, essentially acting as a matchmaker for jobs and resources with priorities acting as constraints. The challenges with this architecture, such as developing a robust API that could capture all the varied requirements of different frameworks, anticipating new frameworks, and solving a complex scheduling problem for millions of jobs, made the former approach a much more attractive option for the creators. Summary Thus in this article, we introduced Mesos, and then dived deep into its architecture to understand importance of Mesos. Resources for Article:   Further resources on this subject: Understanding Mesos Internals [article] Leveraging Python in the World of Big Data [article] Self-service Business Intelligence, Creating Value from Data [article]
Read more
  • 0
  • 0
  • 10027

Packt
30 Mar 2016
15 min read
Save for later

ALM – Developers and QA

Packt
30 Mar 2016
15 min read
This article by Can Bilgin, the author of Mastering Cross-Platform Development with Xamarin, provides an introduction to Application Lifecycle Management (ALM) and continuous integration methodologies on Xamarin cross-platform applications. As the part of the ALM process that is most relevant for developers, unit test strategies will be discussed and demonstrated, as well as automated UI testing. This article is divided into the following sections: Development pipeline Troubleshooting Unit testing UI testing (For more resources related to this topic, see here.) Development pipeline The development pipeline can be described as the virtual production line that steers a project from a mere bundle of business requirements to the consumers. Stakeholders that are part of this pipeline include, but are not limited to, business proxies, developers, the QA team, the release and configuration team, and finally the consumers themselves. Each stakeholder in this production line assumes different responsibilities, and they should all function in harmony. Hence, having an efficient, healthy, and preferably automated pipeline that is going to provide the communication and transfer of deliverables between units is vital for the success of a project. In the Agile project management framework, the development pipeline is cyclical rather than a linear delivery queue. In the application life cycle, requirements are inserted continuously into a backlog. The backlog leads to a planning and development phase, which is followed by testing and QA. Once the production-ready application is released, consumers can be made part of this cycle using live application telemetry instrumentation. Figure 1: Application life cycle management In Xamarin cross-platform application projects, development teams are blessed with various tools and frameworks that can ease the execution of ALM strategies. From sketching and mock-up tools available for early prototyping and design to source control and project management tools that make up the backbone of ALM, Xamarin projects can utilize various tools to automate and systematically analyze project timeline. The following sections of this article concentrate mainly on the lines of defense that protect the health and stability of a Xamarin cross-platform project in the timeline between the assignment of tasks to developers to the point at which the task or bug is completed/resolved and checked into a source control repository. Troubleshooting and diagnostics SDKs associated with Xamarin target platforms and development IDEs are equipped with comprehensive analytic tools. Utilizing these tools, developers can identify issues causing app freezes, crashes, slow response time, and other resource-related problems (for example, excessive battery usage). Xamarin.iOS applications are analyzed using the XCode Instruments toolset. In this toolset, there are a number of profiling templates, each used to analyze a certain perspective of application execution. Instrument templates can be executed on an application running on the iOS simulator or on an actual device. Figure 2: XCode Instruments Similarly, Android applications can be analyzed using the device monitor provided by the Android SDK. Using Android Monitor, memory profile, CPU/GPU utilization, and network usage can also be analyzed, and application-provided diagnostic information can be gathered. Android Debug Bridge (ADB) is a command-line tool that allows various manual or automated device-related operations. For Windows Phone applications, Visual Studio provides a number of analysis tools for profiling CPU usage, energy consumption, memory usage, and XAML UI responsiveness. XAML diagnostic sessions in particular can provide valuable information on problematic sections of view implementation and pinpoint possible visual and performance issues: Figure 3: Visual Studio XAML analyses Finally, Xamarin Profiler, as a maturing application (currently in preview release), can help analyze memory allocations and execution time. Xamarin Profiler can be used with iOS and Android applications. Unit testing The test-driven development (TDD) pattern dictates that the business requirements and the granular use-cases defined by these requirements should be initially reflected on unit test fixtures. This allows a mobile application to grow/evolve within the defined borders of these assertive unit test models. Whether following a TDD strategy or implementing tests to ensure the stability of the development pipeline, unit tests are fundamental components of a development project. Figure 4: Unit test project templates Xamarin Studio and Visual Studio both provide a number of test project templates targeting different areas of a cross-platform project. In Xamarin cross-platform projects, unit tests can be categorized into two groups: platform-agnostic and platform-specific testing. Platform-agnostic unit tests Platform-agnostic components, such as portable class libraries containing shared logic for Xamarin applications, can be tested using the common unit test projects targeting the .NET framework. Visual Studio Test Tools or the NUnit test framework can be used according to the development environment of choice. It is also important to note that shared projects used to create shared logic containers for Xamarin projects cannot be tested with .NET unit test fixtures. For shared projects and the referencing platform-specific projects, platform-specific unit test fixtures should be prepared. When following an MVVM pattern, view models are the focus of unit test fixtures since, as previously explained, view models can be perceived as a finite state machine where the bindable properties are used to create a certain state on which the commands are executed, simulating a specific use-case to be tested. This approach is the most convenient way to test the UI behavior of a Xamarin application without having to implement and configure automated UI tests. While implementing unit tests for such projects, a mocking framework is generally used to replace the platform-dependent sections of the business logic. Loosely coupling these dependent components makes it easier for developers to inject mocked interface implementations and increases the testability of these modules. The most popular mocking frameworks for unit testing are Moq and RhinoMocks. Both Moq and RhinoMocks utilize reflection and, more specifically, the Reflection.Emit namespace, which is used to generate types, methods, events, and other artifacts in the runtime. Aforementioned iOS restrictions on code generation make these libraries inapplicable for platform-specific testing, but they can still be included in unit test fixtures targeting the .NET framework. For platform-specific implementation, the True Fakes library provides compile time code generation and mocking features. Depending on the implementation specifics (such as namespaces used, network communication, multithreading, and so on), in some scenarios it is imperative to test the common logic implementation on specific platforms as well. For instance, some multithreading and parallel task implementations give different results on Windows Runtime, Xamarin.Android, and Xamarin.iOS. These variations generally occur because of the underlying platform's mechanism or slight differences between the .NET and Mono implementation logic. In order to ensure the integrity of these components, common unit test fixtures can be added as linked/referenced files to platform-specific test projects and executed on the test harness. Platform-specific unit tests In a Xamarin project, platform-dependent features cannot be unit tested using the conventional unit test runners available in Visual Studio Test Suite and NUnit frameworks. Platform-dependent tests are executed on empty platform-specific projects that serve as a harness for unit tests for that specific platform. Windows Runtime application projects can be tested using the Visual Studio Test Suite. However, for Android and iOS, the NUnit testing framework should be used, since Visual Studio Test Tools are not available for the Xamarin.Android and Xamarin.iOS platforms.                              Figure 5: Test harnesses The unit test runner for Windows Phone (Silverlight) and Windows Phone 8.1 applications uses a test harness integrated with the Visual Studio test explorer. The unit tests can be executed and debugged from within Visual Studio. Xamarin.Android and Xamarin.iOS test project templates use NUnitLite implementation for the respective platforms. In order to run these tests, the test application should be deployed on the simulator (or the testing device) and the application has to be manually executed. It is possible to automate the unit tests on Android and iOS platforms through instrumentation. In each Xamarin target platform, the initial application lifetime event is used to add the necessary unit tests: [Activity(Label = "Xamarin.Master.Fibonacci.Android.Tests", MainLauncher = true, Icon = "@drawable/icon")] public class MainActivity : TestSuiteActivity { protected override void OnCreate(Bundle bundle) { // tests can be inside the main assembly //AddTest(Assembly.GetExecutingAssembly()); // or in any reference assemblies AddTest(typeof(Fibonacci.Android.Tests.TestsSample).Assembly); // Once you called base.OnCreate(), you cannot add more assemblies. base.OnCreate(bundle); } } In the Xamarin.Android implementation, the MainActivity class derives from the TestSuiteActivity, which implements the necessary infrastructure to run the unit tests and the UI elements to visualize the test results. On the Xamarin.iOS platform, the test application uses the default UIApplicationDelegate, and generally, the FinishedLaunching event delegate is used to create the ViewController for the unit test run fixture: public override bool FinishedLaunching(UIApplication application, NSDictionary launchOptions) { // Override point for customization after application launch. // If not required for your application you can safely delete this method var window = new UIWindow(UIScreen.MainScreen.Bounds); var touchRunner = new TouchRunner(window); touchRunner.Add(System.Reflection.Assembly.GetExecutingAssembly()); window.RootViewController = new UINavigationController(touchRunner.GetViewController()); window.MakeKeyAndVisible(); return true; } The main shortcoming of executing unit tests this way is the fact that it is not easy to generate a code coverage report and archive the test results. Neither of these testing methods provide the ability to test the UI layer. They are simply used to test platform-dependent implementations. In order to test the interactive layer, platform-specific or cross-platform (Xamarin.Forms) coded UI tests need to be implemented. UI testing In general terms, the code coverage of the unit tests directly correlates with the amount of shared code which amounts to, at the very least, 70-80 percent of the code base in a mundane Xamarin project. One of the main driving factors of architectural patterns was to decrease the amount of logic and code in the view layer so that the testability of the project utilizing conventional unit tests reaches a satisfactory level. Coded UI (or automated UI acceptance) tests are used to test the uppermost layer of the cross-platform solution: the views. Xamarin.UITests and Xamarin Test Cloud The main UI testing framework used for Xamarin projects is the Xamarin.UITests testing framework. This testing component can be used on various platform-specific projects, varying from native mobile applications to Xamarin.Forms implementations, except for the Windows Phone platform and applications. Xamarin.UITests is an implementation based on the Calabash framework, which is an automated UI acceptance testing framework targeting mobile applications. Xamarin.UITests is introduced to the Xamarin.iOS or Xamarin.Android applications using the publicly available NuGet packages. The included framework components are used to provide an entry point to the native applications. The entry point is the Xamarin Test Cloud Agent, which is embedded into the native application during the compilation. The cloud agent is similar to a local server that allows either the Xamarin Test Cloud or the test runner to communicate with the app infrastructure and simulate user interaction with the application. Xamarin Test Cloud is a subscription-based service allowing Xamarin applications to be tested on real mobile devices using UI tests implemented via Xamarin.UITests. Xamarin Test Cloud not only provides a powerful testing infrastructure for Xamarin.iOS and Xamarin.Android applications with an abundant amount of mobile devices but can also be integrated into Continuous Integration workflows. After installing the appropriate NuGet package, the UI tests can be initialized for a specific application on a specific device. In order to initialize the interaction adapter for the application, the app package and the device should be configured. On Android, the APK package path and the device serial can be used for the initialization: IApp app = ConfigureApp.Android.ApkFile("<APK Path>/MyApplication.apk") .DeviceSerial("<DeviceID>") .StartApp(); For an iOS application, the procedure is similar: IApp app = ConfigureApp.iOS.AppBundle("<App Bundle Path>/MyApplication.app") .DeviceIdentifier("<DeviceID of Simulator") .StartApp(); Once the App handle has been created, each test written using NUnit should first create the pre-conditions for the tests, simulate the interaction, and finally test the outcome. The IApp interface provides a set of methods to select elements on the visual tree and simulate certain interactions, such as text entry and tapping. On top of the main testing functionality, screenshots can be taken to document test steps and possible bugs. Both Visual Studio and Xamarin Studio provide project templates for Xamarin.UITests. Xamarin Test Recorder Xamarin Test Recorder is an application that can ease the creation of automated UI tests. It is currently in its preview version and is only available for the Mac OS platform. Figure 6: Xamarin Test Recorder Using this application, developers can select the application in need of testing and the device/simulator that is going to run the application. Once the recording session starts, each interaction on the screen is recorded as execution steps on a separate screen, and these steps can be used to generate the preparation or testing steps for the Xamarin.UITests implementation. Coded UI tests (Windows Phone) Coded UI tests are used for automated UI testing on the Windows Phone platform. Coded UI Tests for Windows Phone and Windows Store applications are not any different than their counterparts for other .NET platforms such as Windows Forms, WPF, or ASP.Net. It is also important to note that only XAML applications support Coded UI tests. Coded UI tests are generated on a simulator and written on an Automation ID premise. The Automation ID property is an automatically generated or manually configured identifier for Windows Phone applications (only in XAML) and the UI controls used in the application. Coded UI tests depend on the UIMap created for each control on a specific screen using the Automation IDs. While creating the UIMap, a crosshair tool can be used to select the application and the controls on the simulator screen to define the interactive elements: Figure 7:- Generating coded UI accessors and tests Once the UIMap has been created and the designer files have been generated, gestures and the generated XAML accessors can be used to create testing pre-conditions and assertions. For Coded UI tests, multiple scenario-specific input values can be used and tested on a single assertion. Using the DataRow attribute, unit tests can be expanded to test multiple data-driven scenarios. The code snippet below uses multiple input values to test different incorrect input values: [DataRow(0,"Zero Value")] [DataRow(-2, "Negative Value")] [TestMethod] public void FibonnaciCalculateTest_IncorrectOrdinal(int ordinalInput) { // TODO: Check if bad values are handled correctly } Automated tests can run on available simulators and/or a real device. They can also be included in CI build workflows and made part of the automated development pipeline. Calabash Calabash is an automated UI acceptance testing framework used to execute Cucumber tests. Cucumber tests provide an assertion strategy similar to coded UI tests, only broader and behavior oriented. The Cucumber test framework supports tests written in the Gherkin language (a human-readable programming grammar description for behavior definitions). Calabash makes up the necessary infrastructure to execute these tests on various platforms and application runtimes. A simple declaration of the feature and the scenario that is previously tested on Coded UI using the data-driven model would look similar to the excerpt below. Only two of the possible test scenarios are declared in this feature for demonstration; the feature can be extended: Feature: Calculate Single Fibonacci number. Ordinal entry should greater than 0. Scenario: Ordinal is lower than 0. Given I use the native keyboard to enter "-2" into text field Ordinal And I touch the "Calculate" button Then I see the text "Ordinal cannot be a negative number." Scenario: Ordinal is 0. Given I use the native keyboard to enter "0" into text field Ordinal And I touch the "Calculate" button Then I see the text "Cannot calculate the number for the 0th ordinal." Calabash test execution is possible on Xamarin target platforms since the Ruby API exposed by the Calabash framework has a bidirectional communication line with the Xamarin Test Cloud Agent embedded in Xamarin applications with NuGet packages. Calabash/Cucumber tests can be executed on Xamarin Test Cloud on real devices since the communication between the application runtime and Calabash framework is maintained by Xamarin Test Cloud Agent, the same as Xamarin.UI tests. Summary Xamarin projects can benefit from a properly established development pipeline and the use of ALM principles. This type of approach makes it easier for teams to share responsibilities and work out business requirements in an iterative manner. In the ALM timeline, the development phase is the main domain in which most of the concrete implementation takes place. In order for the development team to provide quality code that can survive the ALM cycle, it is highly advised to analyze and test native applications using the available tooling in Xamarin development IDEs. While the common codebase for a target platform in a Xamarin project can be treated and tested as a .NET implementation using the conventional unit tests, platform-specific implementations require more particular handling. Platform-specific parts of the application need to be tested on empty shell applications, called test harnesses, on the respective platform simulators or devices. To test views, available frameworks such as Coded UI tests (for Windows Phone) and Xamarin.UITests (for Xamarin.Android and Xamarin.iOS) can be utilized to increase the test code coverage and create a stable foundation for the delivery pipeline. Most tests and analysis tools discussed in this article can be integrated into automated continuous integration processes. Resources for Article:   Further resources on this subject: A cross-platform solution with Xamarin.Forms and MVVM architecture [article] Working with Xamarin.Android [article] Application Development Workflow [article]
Read more
  • 0
  • 0
  • 22117

article-image-golang-decorators-logging-time-profiling
Nicholas Maccharoli
30 Mar 2016
6 min read
Save for later

Golang Decorators: Logging & Time Profiling

Nicholas Maccharoli
30 Mar 2016
6 min read
Golang's imperative world Golang is not, by any means, a functional language; its design remains true to its jingle, which says that it is "C for the 21st Century". One task I tried to do early on in learning the language was search for the map, filter, and reduce functions in the standard library but to no avail. Next, I tried rolling my own versions, but I felt as though I hit a bit of a road block when I discovered that there is no support for generics in the language at the time of writing this. There is, however, support for Higher Order Functions or, more simply put, functions that take other functions as arguments and return functions. If you have spent some time in Python, you may have come to love a design pattern called "Decorator". In fact, decorators make life in Python so great that support for applying them is built right into the language with a nifty @ operator! Python frameworks such as Flask extensively use decorators. If you have little or no experience in Python, fear not for the concept is a design pattern independent of any language. Decorators An alternative name for the decorator pattern is "wrapper", which pretty much sums it all up in one word! A decorator's job is only to wrap a function so that additional code can be executed when the original function is called. This is accomplished by writing a function that takes a function as its argument and returns a function of the same type (Higher Order Functions in action!). While this still calls the original function and passes through its return value, it does something extra along the way. Decorators for logging We can easily log which specific method is passed with a little help from our decorator friends. Say, we wanted to log which user liked a blog post and what the ID of the post was all without touching any code in the original likePost function. Here is our original function: func likePost(userId int, postId int) bool { fmt.Printf("Update Complete!n") return true } Our decorator might look something similar to this: type LikeFunc func(int, int) bool func decoratedLike(f LikeFunc) LikeFunc { return func(userId int, postId int) bool { fmt.Printf("likePost Log: User %v liked post# %vn", userId, postId) return f(userId, postId) } } Note the use of the type definition here. I encourage you to use it for the sake of readability when defining functions with long signatures, such as those of decorators, as you need to type the function signature twice. Now, we can apply the decorator and allow the logging to begin: r := likeStats(likePost) r(1414, 324) r(5454, 324) r(4322, 250) This produces the following output: likePost Log: User 1414 liked post# 324 Update Complete! likePost Log: User 5454 liked post# 324 Update Complete! likePost Log: User 4322 liked post# 250 Update Complete! Our original likePost function still gets called and runs as expected, but now we get an additional log detailing the user and post IDs that were passed to the function each time it was called. Hopefully, this will help speed up debugging our likePost function if and when we encounter strange behavior! Decorators for performance! Say, we run a "Top 10" site and previously, our main sorting routine to find the top 10 cat photos of this week on the Internet was written with Golang's func Sort(data Interface) function from the sort package of the Golang standard library. Everything is fine until we are informed that Fluffy the cat is infuriated that she is coming in at number six on the list and not number five. The cats at ranks five and six on the list both had 5000 likes each, but Fluffy reached 5000 likes a day earlier than Bozo the cat, who is currently higher ranked. We like to give credit where it's due, so we apologize to Fluffy and go on to use the stable version of the func Stable(data Interface) sort, which preserves the order of elements equal in value during the sort. We can improve our code and tests so that this does not happen again (We promised Fluffy!). The tests pass, everything looks great, and we deploy gracefully... or so we think. Over the course of the day, other developers also deploy their changes, and then, after checking our performance reports, we notice a slowdown somewhere. Is it from our switch to stable the sorting? Well, let’s use decorators to measure the performance of both sort functions and check whether there is a noticeable dip in performance. Here’s our testing function: type SortFunc func(sort.Interface) func timedSortFunc(f SortFunc) SortFunc { return func(data sort.Interface) { defer func(t time.Time) { fmt.Printf("--- Time Elapsed: %v ---n", time.Since(t)) }(time.Now()) f(data) } } In case you are unfamiliar with defer, all it does is call the function it is passed right after its calling function returns. The arguments passed to defer are evaluated right away, so the value we get from time.Now() is really the start time of the function! Let’s go ahead and give this test a go: stable := timedSortFunc(sort.Stable) unStable := timedSortFunc(sort.Sort) // 10000 Elements with values ranging // between 0 and 5000 randomCatList1 := randomCatScoreSlice(10000, 5000) randomCatList2 := randomCatScoreSlice(10000, 5000) fmt.Printf("Unstable Sorting Function:n") stable(randomCatList1) fmt.Printf("Stable Sorting Function:n") unStable(randomCatList2) The following output is yielded: Unstable Sorting Function: --- Time Elapsed: 282.889µs --- Stable Sorting Function: --- Time Elapsed: 93.947µs --- Wow! Fluffy's complaint not only made our top 10 list more accurate but now they sort about three times as fast with the stable version of sort as well! (However, we still need to be careful; sort.Stable most likely uses way more memory than the standard sort.Sort function.) Final thoughts Figuring out when and where to apply the decorator pattern is really up to you and your team. There are no hard rules, and you can completely live without it. However, when it comes to things like extra logging or profiling a pesky area of your code, this technique may prove to be a valuable tool. Where is the rest of the code? In order get this example up and running, there is some setup code that was not shown here in order to keep the post from becoming too bloated. I encourage you take a look at this code here if you are interested! About the author Nick Maccharoli is an iOS/backend developer and open source enthusiast working at a start-up in Tokyo and enjoying the current development scene. You can see what he is up to at @din0sr or github.com/nirma.
Read more
  • 0
  • 0
  • 19758
article-image-boosting-performance-database
Packt
29 Mar 2016
10 min read
Save for later

Boosting up the Performance of a Database

Packt
29 Mar 2016
10 min read
 In this article by Altaf Hussain, author of the book Learning PHP 7 High Performance we will see how databases play a key role in dynamic websites. All incoming and outgoing data is stored in databases. So if the database for a PHP application is not well-designed and optimized, then it will affect the application performance tremendously. In this article, we will be looking into the ways to optimize our PHP application database. (For more resources related to this topic, see here.) MySQL MySQL is the most used Relational Database Management System (RDMS) for the web. It is open source and has a free community version. It provides all those features, which can be provided by an enterprise-level database. The default settings provided with the MySQL installation may not be so good for performance, and there are always ways to fine-tune settings to get an increased performance. Also, remember that your database design also plays a role in performance. A poorly designed database will have an effect on overall performance. In this article, we will discuss how to improve the MySQL database performance. We will be modifying the MySQL configuration my.cnf file. This file is located in different places in different OSes. Also, if you are using XAMPP, WAMP, and so on, on Windows, this file will be located in those respective folders. Whenever my.cnf is mentioned, it is assumed that the file is open no matter which OS is used. Query Caching Query Caching is an important performance feature of MySQL. It caches SELECT queries along with the resulting dataset. When an identical SELECT query occurs, MySQL fetches the data from memory; hence, the query is executed faster. Thus, this reduces the load on the database. To check whether query cache is enabled on a MySQL server or not, issue the following command in your MySQL command line: SHOW VARIABLES LIKE 'have_query_cache'; This command will display an output, as follows: This result set shows that query cache is enabled. If query cache is disabled, the value will be NO. To enable query caching, open up the my.cnf file and add the following lines. If these lines are present, just uncomment them if they are commented: query_cache_type = 1 query_cache_size = 128MB query_cache_limit = 1MB Save the my.cnf file and restart the MySQL server. Let's discuss what these three configurations mean. query_cache_size The query_cache_size parameter means how much memory will be allocated. Some will think that the more memory used, the better this is; but this is just a misunderstanding. It all depends on the size of the database, the types of queries, and ratios between read and writes, hardware and database traffic, and so on. A good value for query_cache_size is in between 100 MB and 200 MB. Then, monitor the performance and the other previously mentioned variables on which the query cache depends, and adjust the size. We have used 128 MB for a medium range traffic magento website, and it is working perfectly. Set this value to 0 to disable the query cache. query_cache_limit This defines the maximum size of a query dataset to be cached. If the size of a query dataset is larger than this value, it won't be cached. The value of this configuration can be guessed by finding out the largest select query and the size of its returned dataset. query_cache_type The query_cache_type parameter plays a weird role. If query_cache_type is set to 1, then the following may occur: If query_cache_size is 0, then no memory is allocated and query cache is disabled If query_cache_size is greater than 0, then query cache is enabled, memory is allocated, and all queries that do not exceed query_cache_limit and use the SQL_NO_CACHE option will be cached If query_cache_type value is 0, then the following occurs: If query_cache_size is 0, then no memory is allocated and the cache is disabled If query_cache_size is greater than 0, then the memory is allocated, but nothing is cached, that is, the cache is disabled Storage Engines Storage Engines (or Table Types) are a part of core MySQL and are responsible for handling operations on tables. MySQL provides several storage engines, and the two most widely-used are MyISAM and InnoDB. Both storage engines have their own pros and cons, but InnoDB is always prioritized. MySQL started to use InnoDB as its default storage engine starting from version 5.5. MySQL provides some other storage engines, which have their own purposes. During the database design process, which table should use which storage engine can be decided. A complete list of storage engines for MySQL 5.6 can be found at http://dev.mysql.com/doc/refman/5.6/en/storage-engines.html. Storage engine can be set at database level, which will be then used as default storage engine for each newly created table. Note that the storage engine is table-based and different tables can have different storage engines in a single database. What if we have a table already created and we want to change its storage engine? This is easy. Let's say our table name is pkt_users and its storage engine is MyISAM and we want to change it to InnoDB, then we will use the following MySQL command: ALTER TABLE pkt_users ENGINE=INNODB; This will change the storage engine of the table to InnoDB. Now, let's discuss the difference between the two most widely-used storage engines MyISAM and InnoDB. MyISAM A brief list of features that are or are not supported by MyISAM is as follows: MyISAM is designed for speed, which plays best with SELECT statement. If a table is more static, that is, the data in that table is less frequently updated or deleted and mostly the data is only fetched, then MyISAM is best for this table. MyISAM supports table-level locking. If a specific operation needs to be performed on data in a table, then the complete table can be locked. During this lock, no operation can be performed on this table. This can cause performance degradation if the table is more dynamic, that is, the data is frequently changing in the table. MyISAM does not have support for Foreign Keys (FK). MyISAM supports fulltext search. MyISAM does not support transactions. So, there is no support for commit and rollback. If a query on a table is executed, it is executed and there is no coming back. Data compression, Replication, Query Cache, and Data encryption is supported. Cluster database is not supported. InnoDB A brief list of features that are or are not supported by InnoDB is as follows: InnoDB is designed for high reliability and high performance when processing a high volume of data. InnoDB supports row-level locking. It is a good feature and is great for performance. Instead of locking the complete table like MyISAM, it locks only the specific rows for SELECT, DELETE, or UPDATE operations; and during these operations, other data in this table can be manipulated. InnoDB supports Foreign Keys and support forcing Foreign Keys Constraints. Transactions are supported. Commits and rollbacks are possible; hence, data can be recovered from a specific transaction. Data Compression, Replication, Query Cache, and Data encryption is supported. InnoDB can be used in a cluster environment, but it does not have full support. However, the InnoDB tables can be converted to an NDB storage engine, which is used in a MySQL cluster by changing the table engine to NDB. In the following sections, we will discuss some more performance features that are related to InnoDB. Values for the following configuration are set in the my.cnf file. InnoDB_buffer_pool_size This setting defines how much memory should be used for InnoDB data and indexes loaded into memory. For a dedicated MySQL server, the recommended value is 50-80% of the installed memory on the sever. If this value is set to a high value, then there will be no memory left for the operating system and other subsystems of MySQL, such as transaction logs. So, let's open our my.cnf file, search for innodb_buffer_pool_size, and set the value in between the recommended value (50-80%) of our RAM. Innoddb_buffer_pool_instances This feature is not that widely-used. This feature enables multiple buffer pool instances to work together to reduce the chances of memory contentions on 64 bits' system and with a large value for innodb_buffer_pool_size. There are different choices on which the value for innodb_buffer_pool_instances should be calculated. One way is to use one instance per GB of innodb_buffer_pool_size. So, if the value of innodb_bufer_pool_size is 16 GB, we will set innodb_buffer_pool_instances to 16. InnoDB_log_file_size Inno_db_log_file_size is the the size of the log file that stores every query information that has been executed. For a dedicated server, a value up to 4 GB is safe, but the time of crash recovery may increase if the log file size is too big. So, in best practices, it should be kept in between 1 GB to 4 GB. Percona server According to Percona website, "Percona server is a free, fully compatible, enhanced, open source drop-in replacement for MySQL that provides superior performance, scalability, and instrumentation." Percona is a fork of MySQL with enhanced features for performance. All the features available in MySQL are available in Percona. Percona uses an enhanced storage engine, which is called XtraDB. According to the Percona website: "Percona XtraDB is an enhanced version of the InnoDB storage engine for MySQL, which has more features, faster performance, and better scalability on modern hardware. Percona XtraDB uses memory more efficiently in high-load environments." As mentioned previously, XtraDB is a fork of InnoDB, so all features available with InnoDB are available in XtraDB. Installation Percona is only available for Linux systems. It is not available for Windows as of now. In this book, we will install the Percona server on Debian 8. The process is the same for both Ubuntu and Debian. To install the Percona server on other Linux flavors, check out the Percona Installation manual at https://www.percona.com/doc/percona-server/5.5/installation.html. As of now, they provide instructions for Debian, Ubuntu, CentOS, and RHEL. They also provide instructions to install the Percona server from sources and Git. Now, let's install Percona server using the following steps: Open your sources list file using the following command in your terminal: sudo nano /etc/apt/sources.list If prompted for a password, enter your Debian password. The file will be opened. Now, place the following repository information at the end of the sources.list file: deb http://repo.percona.com/apt jessie main deb-src http://repo.percona.com/apt jessie main Save the file by clicking on CTRL + O and close the file by clicking on CTRL + X. Update your system using the following command in terminal: sudo apt-get update Start the installation by issuing the following command in terminal: sudo apt-get install percona-server-server-5.5 The installation will start. The process is the same as the MySQL server installation. During installation, the root password for the Percona server will be asked. You just need to enter it. When the installation is completed, you are ready to use the Percona server in the same way as you would use MySQL. Configure the Percona server and optimize it as discussed in the previous sections. Summary In this article, we studied the MySQL and Percona servers with Query Caching and other MySQL configuration options for performance. We also compared different storage engines and Percona XtraDB. We saw MySQL Workbench Performance monitoring tools as well. Resources for Article: Further resources on this subject: Building a Web Application with PHP and MariaDB – Introduction to caching [article] PHP Magic Features [article] Understanding PHP basics [article]
Read more
  • 0
  • 0
  • 3014

article-image-building-product-recommendation-system
Packt
29 Mar 2016
25 min read
Save for later

Building a Product Recommendation System

Packt
29 Mar 2016
25 min read
 In this article by Raghav Bali and Dipanjan Sarkar, author of the book R Machine Learning By Example, we will discuss about, collaborative filtering is a simple yet very effective approach for predicting and recommending items to users. If we look closely, the algorithms work on input data, which is nothing but a matrix representation of the user ratings for different products. Bringing in a mathematical perspective into the picture, matrix factorization is a technique to manipulate matrices and identify latent or hidden features from the data represented in the matrix. Building upon the same concept, let us use matrix factorization as the basis for predicting ratings for items which the user has not yet rated. (For more resources related to this topic, see here.)  Matrix factorization Matrix factorization refers to identification of two or more matrices such that when these matrices are multiplied we get the original matrix. Matrix factorization, as mentioned earlier, can be used to discover latent features between two different kinds of entities. We will understand and use the concepts of matrix factorization as we go along preparing our recommender engine for our e-commerce platform. As our aim for the current project is to personalize the shopping experience and recommend product ratings for an e-commerce platform, our input data contains user ratings for various products on the website. We process the input data and transform it into matrix representation for analyzing it using matrix factorization. The input data looks like this: User ratings matrix As you can see, the input data is a matrix with each row representing a particular user's rating for different items represented in the columns. For the current case, the columns representing items are different mobile phones such as iPhone 4, iPhone 5s, Nexus 5, and so on. Each row contains ratings for each of these mobile phones as given by eight different users. The ratings range from 1 to 5 with 1 being the lowest and 5 being the highest. A rating of 0 represents unrated items or missing rating. The task of our recommender engine will be to predict the correct rating for the missing ones in the input matrix. We could then use the predicted ratings to recommend items most desired by the user. The premise here is that two users would rate a product similarly if they like similar features of the product or item. Since our current data is related to user ratings for different mobile phones, people might rate the phones based on their hardware configuration, price, OS, and so on. Hence, matrix factorization tries to identify these latent features to predict ratings for a certain user and a certain product. While trying to identify these latent features, we proceed with the basic assumption that the number of such features is less than the total number of items in consideration. This assumption makes sense because if this was the case, then each user would have a specific feature associated with him/her (and similarly for the product). This would in turn make recommendations futile as none of the users would be interested in items rated by the other users (which is not the case usually). Now let us get into the mathematical details of matrix factorization and our recommender engine. Since we are dealing with user ratings for different products, let us assume U to be a matrix representing user preferences and similarly, a matrix P representing the products for which we have the ratings. Then the ratings matrix R will be defined as R = U x PT (we take transpose of P as PT for matrix multiplication) where, |R| = |U| x |P|. Assuming the process helps us identify K latent features, our aim is to find two matrices X and Y such that their product (matrix multiplication) approximates R. X = |U| x K matrix Y = |P| x K matrix Here, X is a user related matrix which represents the associations between the users and the latent features. Y, on the other hand, is the product related matrix which represents the associations between the products and the latent features. The task of predicting the rating of a product pj by a user ui is done by calculating the dot product of the vectors corresponding to pj (vector Y, that is the user) and ui (vector X, that is the product). Now, to find the matrices X and Y, we utilize a technique called gradient descent. Gradient descent, in simple terms, tries to find the local minimum of a function; it is an optimization technique. We use gradient descent in the current context to iteratively minimize the difference between the predicted ratings and the actual ratings. To begin with, we randomly initialize the matrices X and Y and then calculate how different their product is from the actual ratings matrix R. The difference between the predicted and the actual values is what is termed as the error. For our problem, we will consider the squared error, which is calculated as: Here, rij is the actual rating by user i for product j and  is the predicted value of the same. To minimize the error, we need to find the correct direction or gradient to change our values to. To obtain the gradient for each of the variables x and y, we differentiate them separately as:   Hence, the equations to find xik and ykj can be given as:   Here α is the constant to denote the rate of descent or the rate of approaching the minima (also known as the learning rate). The value of α defines the size of steps we take in either direction to reach the minima. Large values may lead to oscillations as we may overshoot the minima every time. Usual practice is to select very small values for α, of the order 10-4.  and  are the updated values of xik and ykj after each iteration of gradient descent. To avoid overfitting, along with controlling extreme or large values in the matrices X and Y, we introduce the concept of regularization. Formally, regularization refers to the process of introducing additional information in order to prevent overfitting. Regularization penalizes models with extreme values. To prevent overfitting in our case, we introduce the regularization constant called β. With introduction of β, the equations are updated as follows: Also, As we already have the ratings matrix R and we use it determine how far are our predicted values from the actual, matrix factorization turns into a supervised learning problem. We use some of the rows as our training samples. Let S be our training set with elements being tuples of the form (ui, pj, rij). Thus, our task is to minimise the error (eij) for every tuple (ui, pj, rij) ϵ training set S. The overall error (say E) can be calculated as: E = ∑(ui, pj, rij) ϵS eij   = ∑(ui, pj, rij) ϵS (rij - ∑(k=1 to K) xikykj)2 Implementation Now that we have looked into the mathematics of matrix factorization, let us convert the algorithm into code and prepare a recommender engine for the mobile phone ratings input data set discussed earlier. As shown in the Matrix factorization section, the input dataset is a matrix with each row representing a user's rating for the products mentioned as columns. The ratings range from 1 to 5 with 0 representing the missing values. To transform our algorithm into working code, we need to compute and complete the following tasks: Load the input data and transform it into ratings matrix representation Prepare a matrix factorization based recommendation model Predict and recommend products to the users Interpret and evaluate the model Loading and transforming input data into matrix representation is simple. As seen earlier, R provides us with easy to use utility functions for the same. # load raw ratings from csv raw_ratings <- read.csv(<file_name>)   # convert columnar data to sparse ratings matrix ratings_matrix <- data.matrix(raw_ratings) Now that we have our data loaded into an R matrix, we proceed and prepare the user-latent features matrix X and item-latent features matrix Y. We initialize both from uniform distributions using the runif function. # number of rows in ratings rows <- nrow(ratings_matrix)   # number of columns in ratings matrix columns <- ncol(ratings_matrix)   # latent features K <- 2   # User-Feature Matrix X <- matrix(runif(rows*K), nrow=rows, byrow=TRUE)   # Item-Feature Matrix Y <- matrix(runif(columns*K), nrow=columns, byrow=TRUE) The major component is the matrix factorization function itself. Let us split the task into two, calculation of the gradient and subsequently the overall error. The calculation of the gradient involves the ratings matrix R and the two factor matrices X and Y, along with the constants α and β. Since we are dealing with matrix manipulations (specifically, multiplication), we transpose Y before we begin with any further calculations. The following lines of code convert the algorithm discussed previously into R syntax. All variables follow naming convention similar to the algorithm for ease of understanding. for (i in seq(nrow(ratings_matrix))){         for (j in seq(length(ratings_matrix[i, ]))){           if (ratings_matrix[i, j] > 0){             # error           eij = ratings_matrix[i, j] - as.numeric(X[i, ] %*% Y[, j])          # gradient calculation             for (k in seq(K)){             X[i, k] = X[i, k] + alpha * (2 * eij * Y[k, j]/             - beta * X[i, k])               Y[k, j] = Y[k, j] + alpha * (2 * eij * X[i, k]/             - beta * Y[k, j])           }         }       }     } The next part of the algorithm is to calculate the overall error; we again use similar variable names for consistency: # Overall Squared Error Calculation   e = 0   for (i in seq(nrow(ratings_matrix))){      for (j in seq(length(ratings_matrix[i, ]))){        if (ratings_matrix[i, j] > 0){          e = e + (ratings_matrix[i, j] - /            as.numeric(X[i, ] %*% Y[, j]))^2          for (k in seq(K)){           e = e + (beta/2) * (X[i, k]^2 + Y[k, j]^2)         }       }     } } As a final piece, we iterate over these calculations multiple times to mitigate the risks of cold start and sparsity. We term the variable controlling multiple starts as epoch. We also terminate the calculations once the overall error drops below a certain threshold. Moreover, as we had initialized X and Y from uniform distributions, the predicted values would be real numbers. We round the final output before returning the predicted matrix. Note that this is a very simplistic implementation and a lot of complexity has been kept out for ease of understanding. Hence, this may result in the predicted matrix to contain values greater than 5. For the current scenario, it is safe to assume the values above the max scale of 5 as equivalent to 5 (and similarly for values lesser than 0). We encourage the reader to fine tune the code to handle such cases. Setting α to 0.0002, β to 0.02, K (that is, latent features) to 2, and epoch to 1000, let us see a sample run of our code with overall error threshold set to 0.001: # load raw ratings from csv raw_ratings <- read.csv("product_ratings.csv")   # convert columnar data to sparse ratings matrix ratings_matrix <- data.matrix(raw_ratings)     # number of rows in ratings rows <- nrow(ratings_matrix)   # number of columns in ratings matrix columns <- ncol(ratings_matrix)   # latent features K <- 2   # User-Feature Matrix X <- matrix(runif(rows*K), nrow=rows, byrow=TRUE)   # Item-Feature Matrix Y <- matrix(runif(columns*K), nrow=columns, byrow=TRUE)   # iterations epoch <- 10000   # rate of descent alpha <- 0.0002   # regularization constant beta <- 0.02     pred.matrix <- mf_based_ucf(ratings_matrix, X, Y, K, epoch = epoch)   # setting column names colnames(pred.matrix)<- c("iPhone.4","iPhone.5s","Nexus.5","Moto.X","Moto.G","Nexus.6",/ "One.Plus.One") The preceding lines of code utilize the functions explained earlier to prepare the recommendation model. The predicted ratings or the output matrix looks like the following: Predicted ratings matrix Result interpretation Let us do a quick visual inspection to see how good or bad our predictions have been. Consider users 1 and 3 as our training samples. From the input dataset, we can clearly see that user 1 has given high ratings to iPhones while user 3 has done the same for Android based phones. The following side by side comparison shows that our algorithm has predicted values close enough to the actual values: Ratings by user 1 Let us see the ratings of user 3 in the following screenshot: Ratings by user 3 Now that we have our ratings matrix with updated values, we are ready to recommend products to users. It is common sense to show only the products which the user hasn't rated yet. The right set of recommendations will also enable the seller to pitch the products which have high probability of being purchased by the user. The usual practice is to return a list of top N items from the unrated list of products for each user. The user in consideration is usually termed as the active-user. Let us consider user 6 as our active-user. This user has only rated Nexus 6, One Plus One, Nexus 5, and iPhone4 in that order of rating, that is Nexus 6 was highly rated and iPhone4 was rated the least. Getting a list of Top 2 recommended phones for such a customer using our algorithm would result in Moto X and Moto G (very rightly indeed, do you see why?). Thus, we built a recommender engine smart enough to recommend the right mobile phones to an Android Fanboy and saved the world from yet another catastrophe! Data to the rescue! This simple implementation of recommender engine using matrix factorization gave us a flavor of how such a system actually works. Next, let us get into some real world action using recommender engines. Production ready recommender engines In this article so far, we have learnt about recommender engines in detail and even developed one from scratch (using matrix factorization). Through all this, it is clearly evident how widespread is the application of such systems. E-commerce websites (or for that fact, any popular technology platform) out there today have tonnes of content to offer. Not only that, but the number of users is also huge. In such a scenario, where thousands of users are browsing/buying stuff simultaneously across the globe, providing recommendations to them is a task in itself. To complicate things even further, a good user experience (response times, for example) can create a big difference between two competitors. These are live examples of production systems handling millions of customers day in and day out. Fun Fact Amazon.com is one of the biggest names in the e-commerce space with 244 million active customers. Imagine the amount of data being processed to provide recommendations to such a huge customer base browsing through millions of products! Source: http://www.amazon.com/b?ie=UTF8&node=8445211011 In order to provide a seamless capability for use in such platforms, we need highly optimized libraries and hardware. For a recommender engine to handle thousands of users simultaneously every second, R has a robust and reliable framework called the recommenderlab. Recommenderlab is a widely used R extension designed to provide a robust foundation for recommender engines. The focus of this library is to provide efficient handling of data, availability of standard algorithms and evaluation capabilities. In this section, we will be using recommenderlab to handle a considerably large data set for recommending items to users. We will also use the evaluation functions from recommenderlab to see how good or bad our recommendation system is. These capabilities will help us build a production ready recommender system similar (or at least closer) to what many online applications such as Amazon or Netflix use. The dataset used in this section contains ratings for 100 items as rated by 5000 users. The data has been anonymised and the product names have been replaced by product IDs. The rating scale used is 0 to 5 with 1 being the worst, 5 being the best, and 0 representing unrated items or missing ratings. To build a recommender engine using recommenderlab for a production ready system, the following steps are to be performed: Extract, transform, and analyze the data. Prepare a recommendation model and generate recommendations. Evaluate the recommendation model. We will look at all these steps in the following subsections. Extract, transform, and analyze As in case of any data intensive (particularly machine learning) application, the first and foremost step is to get the data, understand/explore it, and then transform it into the format required by the algorithm deemed fit for the current application. For our recommender engine using recommenderlab package, we will first load the data from a csv file described in the previous section and then explore it using various R functions. # Load recommenderlab library library("recommenderlab")   # Read dataset from csv file raw_data <- read.csv("product_ratings_data.csv")   # Create rating matrix from data ratings_matrix<- as(raw_data, "realRatingMatrix")   #view transformed data image(ratings_matrix[1:6,1:10]) The preceding section of code loads the recommenderlab package and then uses the standard utility function to read the product_ratings_data.csv file. For exploratory as well as further steps, we need the data to be transformed into user-item ratings matrix format (as described in the Core concepts and definitions section). The as(<data>,<type>) utility converts csv into the required ratings matrix format. The csv file contains data in the format shown in the following screenshot. Each row contains a user's rating for a specific product. The column headers are self explanatory. Product ratings data The realRatingMatrix conversion transforms the data into a matrix as shown in the following image. The users are depicted as rows while the columns represent the products. Ratings are represented using a gradient scale where white represents missing/unrated rating while black denotes a rating of 5/best. Ratings matrix representation of our data Now that we have the data in our environment, let us explore some of its characteristics and see if we can decipher some key patterns. First of all, we extract a representative sample from our main data set (refer to the screenshot Product ratings data) and analyse it for: Average rating score for our user population Spread/distribution of item ratings across the user population Number of items rated per user The following lines of code help us explore our data set sample and analyse the points mentioned previously: # Extract a sample from ratings matrix sample_ratings <-sample(ratings_matrix,1000)   # Get the mean product ratings as given by first user rowMeans(sample_ratings[1,])     # Get distribution of item ratings hist(getRatings(sample_ratings), breaks=100,/      xlab = "Product Ratings",main = " Histogram of Product Ratings")   # Get distribution of normalized item ratings hist(getRatings(normalize(sample_ratings)),breaks=100,/             xlab = "Normalized Product Ratings",main = /                 " Histogram of Normalized Product Ratings")   # Number of items rated per user hist(rowCounts(sample_ratings),breaks=50,/      xlab = "Number of Products",main =/      " Histogram of Product Count Distribution") We extract a sample of 1,000 users from our dataset for exploration purposes. The mean of product ratings as given by the first row in our user-rating sample is 2.055. This tells us that this user either hasn't seen/rated many products or he usually rates the products pretty low. To get a better idea of how the users rate products, we generate a histogram of item rating distribution. This distribution peaks around the middle, that is, 3. The histogram is shown next: Histogram for ratings distribution The histogram shows that the ratings are normally distributed around the mean with low counts for products with very high or very low ratings. Finally, we check the spread of the number of products rated by the users. We prepare a histogram which shows this spread: Histogram of number of rated products The preceding histogram shows that there are many users who have rated 70 or more products, as well as there are many users who have rated all the 100 products. The exploration step helps us get an idea of how our data is. We also get an idea about the way the users generally rate the products and how many products are being rated. Model preparation and prediction We have the data in our R environment which has been transformed into the ratings matrix format. In this section, we are interested in preparing a recommender engine based upon user-based collaborative filtering. We will be using similar terminology as described in the previous sections. Recommenderlab provides straight forward utilities to learn and prepare a model for building recommender engines. We prepare our model based upon a sample of just 1,000 users. This way, we can use this model to predict the missing ratings for rest of the users in our ratings matrix. The following lines of code utilize the first thousand rows for learning the model: # Create 'User Based collaborative filtering' model ubcf_recommender <- Recommender(ratings_matrix[1:1000],"UBCF") "UBCF" in the preceding code signifies user-based collaborative filtering. Recommenderlab also provides other algorithms, such as IBCF or Item-Based Collaborative Filtering, PCA or Principal Component Analysis, and others as well. After preparing the model, we use it to predict the ratings for our 1,010th and 1,011th users in the system. Recommenderlab also requires us to mention the number of items to be recommended to the users (in the order of preference of course). For the current case, we mention 5 as the number of items to be recommended. # Predict list of product which can be recommended to given users recommendations <- predict(ubcf_recommender,/                   ratings_matrix[1010:1011], n=5)   # show recommendation in form of the list as(recommendations, "list") The preceding lines of code generate two lists, one for each of the users. Each element in these lists is a product for recommendation. The model predicted that for user 1,010, product prod_93 should be recommended as the top most product followed by prod_79, and so on. # output generated by the model [[1]] [1] "prod_93" "prod_79" "prod_80" "prod_83" "prod_89"   [[2]] [1] "prod_80" "prod_85" "prod_87" "prod_75" "prod_79" Recommenderlab is a robust platform which is optimized to handle large datasets. With a few lines of code, we were able to load the data, learn a model, and even recommend products to the users in virtually no time. Compare this with the basic recommender engine we developed using matrix factorization which involved a lot many lines of code (when compared to recommenderlab) apart from the obvious difference in performance. Model evaluation We have successfully prepared a model and used it for predicting and recommending products to the users in our system. But what do we know about the accuracy of our model? To evaluate the prepared model, recommenderlab has handy and easy to use utilities. Since we need to evaluate our model, we need to split it into training and test data sets. Also, recommenderlab requires us to mention the number of items to be used for testing (it uses the rest for computing the error). For the current case, we will use 500 users to prepare an evaluation model. The model will be based upon 90-10 training-testing dataset split with 15 items used for test sets. # Evaluation scheme eval_scheme <- evaluationScheme(ratings_matrix[1:500],/                       method="split",train=0.9,given=15)   # View the evaluation scheme eval_scheme   # Training model training_recommender <- Recommender(getData(eval_scheme,/                        "train"), "UBCF")   # Preditions on the test dataset test_rating <- predict(training_recommender,/                getData(eval_scheme, "known"), type="ratings")   #Error error <- calcPredictionAccuracy(test_rating,/                    getData(eval_scheme, "unknown"))   error We use the evaluation scheme to train our model based upon UBCF algorithm. The prepared model from the training dataset is used to predict ratings for the given items. We finally use the method calcPredictionAccuracy to calculate the error in predicting the ratings between known and unknown components of the test set. For our case, we get an output as follows: The generated output mentions the values for RMSE or root mean squared error, MSE or mean squared error, and MAE or mean absolute error. For RMSE in particular, the values deviate from the correct values by 1.162 (note that the values might deviate slightly across runs due to various factors such as sampling, iterations, and so on). This evaluation will make more sense when the outcomes are compared from different CF algorithms. For evaluating UBCF, we use IBCF as comparator. The following few lines of code help us prepare an IBCF based model and test the ratings, which can then be compared using the calcPredictionAccuracy utility: # Training model using IBCF training_recommender_2 <- Recommender(getData(eval_scheme,/                                      "train"), "IBCF")   # Preditions on the test dataset test_rating_2 <- predict(training_recommender_2,/                   getData(eval_scheme, "known"),/                 type="ratings")   error_compare <- rbind(calcPredictionAccuracy(test_rating,/                 getData(eval_scheme, "unknown")),/                        calcPredictionAccuracy(test_rating_2,/                 getData(eval_scheme, "unknown")))   rownames(error_compare) <- c("User Based CF","Item Based CF") The comparative output shows that UBCF outperforms IBCF with lower values of RMSE, MSE, and MAE. Similarly, we can use the other algorithms available in recommenderlab to test/evaluate our models. We encourage the user to try out a few more and see which algorithm has the least error in predicted ratings. Summary In this arcticle, we continued our pursuit of using machine learning in the field of e-commerce to enhance sales and overall user experience. In this article, we accounted for the human factor and looked into the recommendation engines based upon user behavior. We started off by understanding what are recommendation systems and their classifications into user-based, content-based, and hybrid recommender systems. We touched upon the problems associated with recommender engines in general. Then we dived deep into the specifics of Collaborative Filters and discussed the math around prediction and similarity measures. After getting our basics straight, we moved onto building a recommender engine of our own from scratch. We utilized matrix factorization to build a recommender engine step by step using a small dummy dataset. We then moved onto building a production ready recommender engine using R's popular library called recommenderlab. We used user-based CF as our core algorithm to build a recommendation model upon a bigger dataset containing ratings for 100 products by 5,000 users. We closed our discussion by evaluating our recommendation model using recommenderlab's utility methods. Resources for Article: Further resources on this subject: Machine Learning with R [article] Introduction to Machine Learning with R [article] Training and Visualizing a neural network with R [article]
Read more
  • 0
  • 0
  • 5424

article-image-making-app-react-and-material-design
Soham Kamani
21 Mar 2016
7 min read
Save for later

Making an App with React and Material Design

Soham Kamani
21 Mar 2016
7 min read
There has been much progression in the hybrid app development space, and also in React.js. Currently, almost all hybrid apps use cordova to build and run web applications on their platform of choice. Although learning React can be a bit of a steep curve, the benefit you get is that you are forced to make your code more modular, and this leads to huge long-term gains. This is great for developing applications for the browser, but when it comes to developing mobile apps, most web apps fall short because they fail to create the "native" experience that so many users know and love. Implementing these features on your own (through playing around with CSS and JavaScript) may work, but it's a huge pain for even something as simple as a material-design-oriented button. Fortunately, there is a library of react components to help us out with getting the look and feel of material design in our web application, which can then be ported to a mobile to get a native look and feel. This post will take you through all the steps required to build a mobile app with react and then port it to your phone using cordova. Prerequisites and dependencies Globally, you will require cordova, which can be installed by executing this line: npm install -g cordova Now that this is done, you should make a new directory for your project and set up a build environment to use es6 and jsx. Currently, webpack is the most popular build system for react, but if that's not according to your taste, there are many more build systems out there. Once you have your project folder set up, install react as well as all the other libraries you would be needing: npm init npm install --save react react-dom material-ui react-tap-event-plugin Making your app Once we're done, the app should look something like this:   If you just want to get your hands dirty, you can find the source files here. Like all web applications, your app will start with an index.html file: <html> <head> <title>My Mobile App</title> </head> <body> <div id="app-node"> </div> <script src="bundle.js" ></script> </body> </html> Yup, that's it. If you are using webpack, your CSS will be included in the bundle.js file itself, so there's no need to put "style" tags either. This is the only HTML you will need for your application. Next, let's take a look at index.js, the entry point to the application code: //index.js import React from 'react'; import ReactDOM from 'react-dom'; import App from './app.jsx'; const node = document.getElementById('app-node'); ReactDOM.render( <App/>, node ); What this does is grab the main App component and attach it to the app-node DOM node. Drilling down further, let's look at the app.jsx file: //app.jsx'use strict';import React from 'react';import AppBar from 'material-ui/lib/app-bar';import MyTabs from './my-tabs.jsx';let App = React.createClass({ render : function(){ return ( <div> <AppBar title="My App" /> <MyTabs /> </div> ); }});module.exports = App; Following react's philosophy of structuring our code, we can roughly break our app down into two parts: The title bar The tabs below The title bar is more straightforward and directly fetched from the material-ui library. All we have to do is supply a "title" property to the AppBar component. MyTabs is another component that we have made, put in a different file because of the complexity: 'use strict';import React from 'react';import Tabs from 'material-ui/lib/tabs/tabs';import Tab from 'material-ui/lib/tabs/tab';import Slider from 'material-ui/lib/slider';import Checkbox from 'material-ui/lib/checkbox';import DatePicker from 'material-ui/lib/date-picker/date-picker';import injectTapEventPlugin from 'react-tap-event-plugin';injectTapEventPlugin();const styles = { headline: { fontSize: 24, paddingTop: 16, marginBottom: 12, fontWeight: 400 }};const TabsSimple = React.createClass({ render: () => ( <Tabs> <Tab label="Item One"> <div> <h2 style={styles.headline}>Tab One Template Example</h2> <p> This is the first tab. </p> <p> This is to demonstrate how easy it is to build mobile apps with react </p> <Slider name="slider0" defaultValue={0.5}/> </div> </Tab> <Tab label="Item 2"> <div> <h2 style={styles.headline}>Tab Two Template Example</h2> <p> This is the second tab </p> <Checkbox name="checkboxName1" value="checkboxValue1" label="Installed Cordova"/> <Checkbox name="checkboxName2" value="checkboxValue2" label="Installed React"/> <Checkbox name="checkboxName3" value="checkboxValue3" label="Built the app"/> </div> </Tab> <Tab label="Item 3"> <div> <h2 style={styles.headline}>Tab Three Template Example</h2> <p> Choose a Date:</p> <DatePicker hintText="Select date"/> </div> </Tab> </Tabs> )});module.exports = TabsSimple; This file has quite a lot going on, so let’s break it down step by step: We import all the components that we're going to use in our app. This includes tabs, sliders, checkboxes, and datepickers. injectTapEventPlugin is a plugin that we need in order to get tab switching to work. We decide the style used for our tabs. Next, we make our Tabs react component, which consists of three tabs: The first tab has some text along with a slider. The second tab has a group of checkboxes. The third tab has a pop-up datepicker. Each component has a few keys, which are specific to it (such as the initial value of the slider, the value reference of the checkbox, or the placeholder for the datepicker). There are a lot more properties you can assign, which are specific to each component. Building your App For building on Android, you will first need to install the Android SDK. Now that we have all the code in place, all that is left is building the app. For this, make a new directory, start a new cordova project, and add the Android platform, by running the following on your terminal: mkdir my-cordova-project cd my-cordova-project cordova create . cordova platform add android Once the installation is complete, build the code we just wrote previously. If you are using the same build system as the source code, you will have only two files, that is, index.html and bundle.min.js. Delete all the files that are currently present in the www folder of your cordova project and copy those two files there instead. You can check whether your app is working on your computer by running cordova serve and going to the appropriate address on your browser. If all is well, you can build and deploy your app: cordova build android cordova run android This will build and install the app on your Android device (provided it is in debug mode and connected to your computer). Similarly, you can build and install the same app for iOS or windows (you may need additional tools such as XCode or .NET for iOS or Windows). You can also use any other framework to build your mobile app. The angular framework also comes with its own set of material design components. About the Author Soham Kamani is a full-stack web developer and electronics hobbyist.  He is especially interested in JavaScript, Python, and IoT.
Read more
  • 0
  • 0
  • 14657
article-image-delegate-pattern-limitations-swift
Anthony Miller
18 Mar 2016
5 min read
Save for later

Delegate Pattern Limitations in Swift

Anthony Miller
18 Mar 2016
5 min read
If you've ever built anything using UIKit, then you are probably familiar with the delegate pattern. The delegate pattern is used frequently throughout Apple's frameworks and many open source libraries you may come in contact with. But many times, it is treated as a one-size-fits-all solution for problems that it is just not suited for. This post will describe the major shortcomings of the delegate pattern. Note: This article assumes that you have a working knowledge of the delegate pattern. If you would like to learn more about the delegate pattern, see The Swift Programming Language - Delegation. 1. Too Many Lines! Implementation of the delegate pattern can be cumbersome. Most experienced developers will tell you that less code is better code, and the delegate pattern does not really allow for this. To demonstrate, let's try implementing a new view controller that has a delegate using the least amount of lines possible. First, we have to create a view controller and give it a property for its delegate: class MyViewController: UIViewController { var delegate: MyViewControllerDelegate? } Then, we define the delegate protocol. protocol MyViewControllerDelegate { func foo() } Now we have to implement the delegate. Let's make another view controller that presents a MyViewController: class DelegateViewController: UIViewController { func presentMyViewController() { let myViewController = MyViewController() presentViewController(myViewController, animated: false, completion: nil) } } Next, our DelegateViewController needs to conform to the delegate protocol: class DelegateViewController: UIViewController, MyViewControllerDelegate { func presentMyViewController() { let myViewController = MyViewController() presentViewController(myViewController, animated: false, completion: nil) } func foo() { /// Respond to the delegate method. } } Finally, we can make our DelegateViewController the delegate of MyViewController: class DelegateViewController: UIViewController, MyViewControllerDelegate { func presentMyViewController() { let myViewController = MyViewController() myViewController.delegate = self presentViewController(myViewController, animated: false, completion: nil) } func foo() { /// Respond to the delegate method. } } That's a lot of boilerplate code that is repeated every time you want to create a new delegate. This opens you up to a lot of room for errors. In fact, the above code has a pretty big error already that we are going to fix now. 2. No Non-Class Type Delegates Whenever you create a delegate property on an object, you should use the weak keyword. Otherwise, you are likely to create a retain cycle. Retain cycles are one of the most common ways to create memory leaks and can be difficult to track down. Let's fix this by making our delegate weak: class MyViewController: UIViewController { weak var delegate: MyViewControllerDelegate? } This causes another problem though. Now we are getting a build error from Xcode! 'weak' cannot be applied to non-class type 'MyViewControllerDelegate'; consider adding a class bound. This is because you can't make a weak reference to a value type, such as a struct or an enum, so in order to use the weak keyword here, we have to guarantee that our delegate is going to be a class. Let's take Xcode's advice here and add a class bound to our protocol: protocol MyViewControllerDelegate: class { func foo() } Well, now everything builds just fine, but we have another issue. Now your delegate must be an object (sorry structs and enums!). You are now creating more constraints on what can conform to your delegate. The whole point of the delegate pattern is to allow an unknown "something" to respond to the delegate events. We should be putting as few constraints as possible on our delegate object, which brings us to the next issue with the delegate pattern. 3. Optional Delegate Methods In pure Swift, protocols don't have optional functions. This means, your delegate must implement every method in the delegate protocol, even if it is irrelevant in your case. For example, you may not always need to be notified when a user taps a cell in a UITableView. There are ways to get around this though. In Swift 2.0+, you can make a protocol extension on your delegate protocol that contains a default implementation for protocol methods that you want to make optional. Let's make a new optional method on our delegate protocol using this method: protocol MyViewControllerDelegate: class { func foo() func optionalFunction() } extension MyViewControllerDelegate { func optionalFunction() { } } This adds even more unnecessary code. It isn't really clear what the intention of this extension is unless you understand what's going on already, and there is no way to explicitly show that this method is optional. Alternatively, if you mark your protocol as @objc, you can use the optional keyword in your function declaration. The problem here is that now your delegate must be an Objective-C object. Just like our last example, this is creating additional constraints on your delegate, and this time they are even more restrictive. 4. There Can Be Only One The delegate pattern only allows for one delegate to respond to events. This may be just fine for some situations, but if you need multiple objects to be notified of an event, the delegate pattern may not work for you. Another common scenario you may come across is when you need different objects to be notified of different delegate events. The delegate pattern can be a very useful tool, which is why it is so widely used, but recognizing the limitations that it creates is important when you are deciding whether it is the right solution for any given problem. About the author Anthony Miller is the lead iOS developer at App-Order in Las Vegas, Nevada, USA. He has written and released numerous apps on the App Store and is an avid open source contributor. When he's not developing, Anthony loves board games, line-dancing, and frequent trips to Disneyland.
Read more
  • 0
  • 0
  • 17579

article-image-neutron-api-basics
Packt
18 Mar 2016
13 min read
Save for later

Neutron API Basics

Packt
18 Mar 2016
13 min read
In this article by James Denton, the author of the book OpenStack Networking Essentials, you can see that Neutron is a virtual networking service that allows users to define network connectivity and IP addressing for instances and other cloud resources using an application programmable interface (API). The Neutron API is made up of core elements that define basic network architectures and extensions that extend base functionality. Neutron accomplishes this by virtue of its data model that consists of networks, subnets, and ports. These objects help define characteristics of the network in an easily storable format. (For more resources related to this topic, see here.) These core elements are used to build a logical network data model using information that corresponds to layers 1 through 3 of the OSI model, shown in the following screenshot: For more information on the OSI model, check out the Wikipedia article at https://en.wikipedia.org/wiki/OSI_model. Neutron uses plugins and drivers to identify network features and construct the virtual network infrastructure based on information stored in the database. A core plugin, such as the Modular Layer 2 (ML2) plugin included with Neutron, implements the core Neutron API and is responsible for adapting the logical network described by networks, ports, and subnets into something that can be implemented by the L2 agent and IP address management system running on the hosts. The extension API, provided by service plugins, allows users to manage the following resources, among others: Security groups Quotas Routers Firewalls Load balancers Virtual private networks Neutron's extensibility means that new features can be implemented in the form of extensions and plugins that extend the API without requiring major changes. This allows vendors to introduce features and functionality that would otherwise not be available with the base API. The following diagram demonstrates at a high level how the Neutron API server interacts with the various plugins and agents responsible for constructing the virtual and physical network across the cloud: The previous diagram demonstrates the interaction between the Neutron API service, Neutron plugins and drivers, and services such as the L2 and L3 agents. As network actions are performed by users via the API, the Neutron server publishes messages to the message queue that are consumed by agents. L2 agents build and maintain the virtual network infrastructure, while L3 agents are responsible for building and maintaining Neutron routers and associated functionality. The Neutron API specifications can be found on the OpenStack wiki at https://wiki.openstack.org/wiki/Neutron/APIv2-specification. In the next few sections, we will look at some of the core elements of the API and the data models used to represent those elements. Networks A network is the central object of the Neutron v2.0 API data model and describes an isolated L2 segment. In a traditional infrastructure, machines are connected to switch ports that are often grouped together into virtual local area networks (VLANs) identified by unique IDs. Machines in the same network or VLAN can communicate with one another but cannot communicate with other networks in other VLANs without the use of a router. The following diagram demonstrates how networks are isolated from one another in a traditional infrastructure: Neutron network objects have attributes that describe the network type and the physical interface used for traffic. The attributes also describe the segmentation ID used to differentiate traffic between other networks connected to virtual switches on the underlying host. The following diagram shows how a Neutron network describes various Layer 1 and Layer 2 attributes: Traffic between instances on different hosts requires underlying connectivity between the hosts. This means that the hosts must reside on the same physical switching infrastructure so that VLAN-tagged traffic can pass between them. Traffic between hosts can also be encapsulated using L2-in-L3 technologies such as GRE or VXLAN. Neutron supports multiple L2 methods of segmenting traffic, including using 802.1q VLANs, VXLANs, GRE, and more, depending on the plugin and configured drivers and agents. Devices in the same network are in the same broadcast domain, even though they may reside on different hosts and attach to different virtual switches. Neutron network attributes are very important in defining how traffic between virtual machine instances should be forwarded between hosts. Network attributes The following table describes base attributes associated with network objects, and more details can be found at the Neutron API specifications wiki referenced earlier in this article: Attribute Type Required Default Notes id uuid-str N/A Auto generated The UUID for the network name string no None The human-readable name for the network admin_state_up boolean no True The administrative state of the network status string N/A Null Indicates whether the network is currently operational subnets list no Empty list The subnets associated with the network shared boolean no False Specifies whether the network can be accessed by any tenant tenant_id uuid-str no N/A The owner of the network Networks are typically associated with tenants or projects and are usable by any user that is a member of the same tenant or project. Networks can also be shared with all other projects or a subnet of projects using Neutron's role-based access control (RBAC) functionality. Neutron RBAC first became available in the Liberty release of OpenStack. For more information on using the RBAC features, check out my blog at the following URL: https://developer.rackspace.com/blog/A-First-Look-at-RBAC-in-the-Liberty-Release-of-Neutron/. Provider attributes One of the earliest extensions to the Neutron API is known as the provider extension. The provider network extension maps virtual networks to physical networks by adding additional network attributes that describe the network type, segmentation ID, and physical interface. The following table shows various provider attributes and their associated values: Attribute Type Required Options Default Notes provider:network_type string yes vlan,flat,local, vxlan,gre Based on the configuration   provider:segmentation_id int optional Depends on the network type Based on the configuration The segmentation ID range varies among L2 technologies provider:physical_network string optional Provider label Based on the configuration This specifies the physical interface used for traffic (flat or VLAN-only) All networks have provider attributes. However, because provider attributes specify particular network configuration settings and mappings, only users with the admin role can specify them when creating networks. Users without the admin role can still create networks, but the Neutron server, not the user, will determine the type of network created and any corresponding interface or segmentation ID. Additional attributes The external-net extension adds an attribute to networks that is used to determine whether or not the network can be used as the external, or gateway, network for a Neutron router. When set to true, the network becomes eligible for use as a floating IP pool when attached to routers. Using the Neutron router-gateway-set command, routers can be attached to external networks. The following table shows the external network attribute and its associated values: Attribute Type Required Default Notes router:external Boolean no false When true, the network is eligible for use as a floating IP pool when attached to a router Subnets In the Neutron data model, a subnet is an IPv4 or IPv6 address block from which IP addresses can be assigned to virtual machine instances and other network resources. Each subnet must have a subnet mask represented by a classless inter-domain routing (CIDR) address and must be associated with a network, as shown here: In the preceding diagram, three isolated VLAN networks each have a corresponding subnet. Instances and other devices cannot be attached to networks without an associated subnet. Instances connected to a network can communicate among one another but are unable to connect to other networks or subnets without the use of a router. The following diagram shows how a Neutron subnet describes various Layer 3 attributes in the OSI model: When creating subnets, users can specify IP allocation pools that limit which addresses in the subnet are available for allocation. Users can also define a custom gateway address, a list of DNS servers, and individual host routes that can be pushed to virtual machine instances using DHCP. The following table describes attributes associated with subnet objects: Attribute Type Required Default Notes id uuid-str n/a Auto Generated The UUID for the subnet network_id uuid-str Yes N/A The UUID of the associated network name string no None The human-readable name for the subnet ip_version int Yes 4 IP version 4 or 6 cidr string Yes N/A The CIDR address representing the IP address range for the subnet gateway_ip string or null no First address in CIDR The default gateway used by devices in the subnet dns_nameservers list(str) no None The DNS name servers used by hosts in the subnet allocation_pools list(dict) no Every address in the CIDR (excluding the gateway) The subranges of the CIDR available for dynamic allocation. tenant_id uuid-str no N/A The owner of the subnet enable_dhcp boolean no True This indicates whether or not DHCP is enabled for the subnet host_routes list(dict) no N/A Additional static routes Ports In the Neutron data model, a port represents a switch port on a logical switch that spans the entire cloud and contains information about the connected device. Virtual machine interfaces (VMIFs) and other network objects, such as router and DHCP server interfaces, are mapped to Neutron ports. The ports define both the MAC address and the IP address to be assigned to the device associated with them. Each port must be associated with a Neutron network. The following diagram shows how a port describes various Layer 2 attributes in the OSI model: The following table describes attributes associated with port objects: Attribute Type Required Default Notes id uuid-str n/a Auto generated The UUID for the subnet network_id uuid-str Yes N/A The UUID of the associated network name string no None The human-readable name for the subnet admin_state_up Boolean no True The administrative state of the port status string N/A N/A The current status of the port (for example, ACTIVE, BUILD, or DOWN) mac_address string no Auto generated The MAC address of the port fixed_ips list(dict) no Auto allocated The IP address(es) associated with the port device_id string no None The instance ID or other resource associated with the port device_owner string no None   tenant_id uuid-str no ID of tenant adding resource The owner of the port When Neutron is first installed, no ports exist in the database. As networks and subnets are created, ports may be created for each of the DHCP servers reflected by the logical switch model, seen here: As instances are created, a single port is created for each network interface attached to the instance, as shown here: A port can only be associated with a single network. Therefore, if an instance is connected to multiple networks, it will be associated with multiple ports. As instances and other cloud resources are created, the logical switch may scale to hundreds or thousands of ports over time, as shown in the following diagram: There is no limit to the number of ports that can be created in Neutron. However, quotas exist that limit tenants to a small number of ports that can be created. As the number of Neutron ports scale out, the performance of the Neutron API server and the implementation of networking across the cloud may degrade over time. It's a good idea to keep quotas in place to ensure a high-performing cloud, but the defaults and subsequent quota increases should be kept reasonable. The Neutron workflow In the standard Neutron workflow, networks must be created first, followed by subnets and then ports. The following sections describe the workflows involved with booting and deleting instances. Booting an instance Before an instance can be created, it must be associated with a network that has a corresponding subnet or a precreated port that is associated with a network. The following process documents the steps involved in booting an instance and attaching it to a network: The user creates a network. The user creates a subnet and associates it with the network. The user boots a virtual machine instance and specifies the network. Nova interfaces with Neutron to create a port on the network. Neutron assigns a MAC address and IP address to the newly created port using attributes defined by the subnet. Nova builds the instance's libvirt XML file containing local network bridge and MAC address information and starts the instance. The instance sends a DHCP request during boot, at which point the DHCP server responds with the IP address corresponding to the MAC address of the instance. If multiple network interfaces are attached to an instance, each network interface will be associated with a unique Neutron port and may send out DHCP requests to retrieve their respective network information. How the logical model is implemented Neutron agents are services that run on network and compute nodes and are responsible for taking information described by networks, subnets, and ports and using it to implement the virtual and physical network infrastructure. In the Neutron database, the relationship between networks, subnets, and ports can be seen in the following diagram: This information is then implemented on the compute node by way of virtual network interfaces, virtual switches or bridges, and IP addresses, as shown in the following diagram: In the preceding example, the instance was connected to a network bridge on a compute node that provides connectivity from the instance to the physical network. For now, it's only necessary to know how the data model is implemented into something that is usable. Deleting an instance The following process documents the steps involved in deleting an instance: The user destroys virtual machine instance. Nova interfaces with Neutron to destroy the ports associated with the instances. Nova deletes local instance data. The allocated IP and MAC addresses are returned to the pool. When instances are deleted, Neutron removes all virtual network connections from the respective compute node and removes corresponding port information from the database. Summary In this article, we looked at the basics of the Neutron API and its data model made up of networks, subnets, and ports. These objects were used to describe in a logical way how the virtual network is architected and implemented across the cloud. Resources for Article: Further resources on this subject: Introducing OpenStack Trove[article] Concepts for OpenStack[article] Monitoring OpenStack Networks[article]
Read more
  • 0
  • 0
  • 3426
Modal Close icon
Modal Close icon