Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-creating-simple-report-using-birt
Packt
23 Oct 2009
5 min read
Save for later

Creating a Simple Report using BIRT

Packt
23 Oct 2009
5 min read
Setting up a Simple Project The first thing we want to do when setting up our simple report project is to define what the project is going to be, and what our first simple report will be. Our first report will be a simple dump of the employees who work for Classic Cars. So, the first thing we need to do is set up a project. To do this, we will use the Navigator. Make sure you have the BIRT report perspective open. Use the following steps to create our project: Open up the Navigator by single-clicking on the Navigator tab. Right-click anywhere in the white-space in the Navigator. Select New from the menu, and under New select Project. From the Dialog screen, select Business Intelligence and Reporting Tools from the list of folders; expand that view, and select Report Project. Then click on the Next button. For the Project name, enter Class_Cars_BIRT_Reports. You can either leave the Use Default Location checkbox checked, or uncheck it and enter a location on your local drive to store this report project. Now, we have a very simple report project in which to store our BIRT reports starting with the first that we are about to create. Creating a Simple Report Now that we have our first project open, we will look at creating our first report. As mentioned earlier, we will create a basic listing report that will display all the information in the employees table. In order to do this, we will use the following steps: Right-click on the Class_Cars_BIRT_Reports project under the Navigator, and choose New and Report. Make sure the Class_Cars_BIRT_Reports project is highlighted in the new report Dialog, and enter in the name as EmployeeList.rptdesign. I chose this name as it is somewhat descriptive of the purpose of the report, which is to display a list of employees. As a rule of thumb, always try to name your reports after the expected output, such as QuarterlyEarningReport.rptdesign, weeklyPayStub.rptdesign, or accountsPayable.rptdesign. On the next screen is a list of different report templates that we can use. We will select  Simple Listing and then click on the Finish button. Go to the Data Explorer, right-click on Data Sources, and choose New Data Source. From the New Data Source Dialog box, select Classic Models Inc. Sample Database and click on the Next button. On the next screen, it will inform you of the driver information. You can ignore this for now and click Finish. Under the Data Explorer, right-click on Data Sets and choose New Data Set. On the next screen, enter the Data Set Name as dsetEmployees, and make sure that our created Data Source is selected in the list of Data sources. You can click Next when this is finished. On the Query Dialog, enter the following query and click Finish: On the next screen, just click OK. This screen is used to edit information about Data Sets, and we will ignore it for now. Now, from the Outline select Data Sets and expand it to show all of the fields. Drag the EMPLOYEENUMBER element over to the Report Designer, and drop it on the cell with the label of Detail Row. This will be the second row and the first column. You will notice that when you do this, the header row also gets an element placed in it called EMPLOYEENUMBER. This is the Header label. Double- click on this cell and it will become highlighted. We can now edit it. Type in "Employee ID". Drag and drop the LASTNAME, FIRSTNAME, and JOBTITLE to the detail cells to the right of the EMPLOYEENUMBER cell. Now, we want to put the header row in bold. Under the Outline, select the Row element located under Body/Table/Header. This will change the Property Editor. Click on Font, and then click on the Bold button. That's it! We have created our first basic report. To see what this report looks like, under the Report Designer pane, click on the Preview tab. This will allow you to get a good idea of what this report will look like. Alternatively you can actually Run the report and get an idea what this report will look like in the BIRT Report Viewer application, by going up to File/View Report/View Report in Web Viewer. This option is also available by right-click on the report design file under the Navigator, and choosing Report followed by Run. Although it may be a simple report, this exercise demonstrated how a report developer can get through the BIRT environment, and how the different elements of the BIRT perspective work together. Summary For a very simple report design, we utilized all of the major areas of the BIRT perspective. We used the Navigator to create a new report project and a new report design, the Data explorer to create out data connection and Data Set, dragged elements from the Outline to the Report Designer to get the data elements into the right place, and used the Property Editor and Outline cooperatively to bold the text in the table header.
Read more
  • 0
  • 0
  • 4572

Packt
04 Apr 2016
20 min read
Save for later

Morphology – Getting Our Feet Wet

Packt
04 Apr 2016
20 min read
In this article by Deepti Chopra, Nisheeth Joshi, and Iti Mathur authors of the book Mastering Natural Language Processing with Python, morphology may be defined as the study of the composition of words using morphemes. A morpheme is the smallest unit of the language that has a meaning. In this article, we will discuss stemming and lemmatizing, creating a stemmer and lemmatizer for non-English languages, developing a morphological analyzer and morphological generator using machine learning tools, creating a search engine, and many other concepts. In brief, this article will include the following topics: Introducing morphology Creating a stemmer and lemmatizer Developing a stemmer for non-English languages Creating a morphological analyzer Creating a morphological generator Creating a search engine (For more resources related to this topic, see here.) Introducing morphology Morphology may be defined as the study of the production of tokens with the help of morphemes. A morpheme is the basic unit of language, which carries a meaning. There are two types of morphemes: stems and affixes (suffixes, prefixes, infixes, and circumfixes). Stems are also referred to as free morphemes since they can even exist without adding affixes. Affixes are referred to as bound morphemes since they cannot exist in a free form, and they always exist along with free morphemes. Consider the word "unbelievable". Here, "believe" is a stem or free morpheme. It can even exist on its own. The morphemes "un" and "able" are affixes or bound morphemes. They cannot exist in s free form but exist together with a stem. There are three kinds of languages, namely isolating languages, agglutinative languages, and inflecting languages. Morphology has different meanings in all these languages. Isolating languages are those languages in which words are merely free morphemes, and they do not carry any tense (past, present, and future) or number (singular or plural) information. Mandarin Chinese is an example of an isolating language. Agglutinative languages are those languages in which small words combine together to convey compound information. Turkish is an example of an agglutinative language. Inflecting languages are languages in which words are broken down into simpler units, but all these simpler units exhibit different meanings. Latin is an example of an inflecting language. There are morphological processes such as inflections, derivations, semi-affixes, combining forms, and cliticization. An inflection refers to transforming a word into a form so that it represents a person, number, tense, gender, case, aspect, and mood. Here, the syntactic category of the token remains the same. In derivation, the syntactic category of word is also changed. Semi-affixes are bound morphemes that exhibit a word-like quality, for example, noteworthy, antisocial, anticlockwise, and so on. Understanding stemmers Stemming may be defined as the process of obtaining a stem from a word by eliminating the affixes from it. For example, in the word "raining", a stemmer would return the root word or the stem word "rain" by removing the affix "ing" from "raining". In order to increase the accuracy of information retrieval, search engines mostly use stemming to get a stem and store it as an index word. Search engines call words with the same meaning synonyms, which may be a kind of query expansion known as conflation. Martin Porter has designed a well-known stemming algorithm known as the Porter Stemming Algorithm. This algorithm is basically designed to replace and eliminate some well-known suffices present in English words. To perform stemming in NLTK, we can simply perform the instantiation of the PorterStemmer class, and then perform stemming by calling the stem method. Let's take a look at the code for stemming using the PorterStemmer class in NLTK: >>> import nltk>>> from nltk.stem import PorterStemmer>>> stemmerporter = PorterStemmer()>>> stemmerporter.stem('working')'work'>>> stemmerporter.stem('happiness')'happi' The PorterStemmer class is trained and has the knowledge of many stems and word forms in the English language. The process of stemming takes place in a series of steps and transforms a word into a shorter word or this word may similar meaning to the root word. The stemmer I interface defines the stem() method, and all stemmers are inherited from this interface. The inheritance diagram is depicted here: Another Stemming algorithm, known as the Lancaster Stemming algorithm, was introduced in Lancaster University. Similar to the PorterStemmer class, the LancasterStemmer class is used in NLTK to implement Lancaster Stemming. Let's consider the following code, which depicts Lancaster stemming in NLTK: >>> import nltk >>> from nltk.stem import LancasterStemmer >>> stemmerlan=LancasterStemmer() >>> stemmerlan.stem('working') 'work' >>> stemmerlan.stem('happiness') 'happy' We can also build our own stemmer in NLTK using RegexpStemmer. This works by accepting a string and eliminates it from the prefix or suffix of a word when a match is found. Let's consider an example of stemming using RegexpStemmer in NLTK: >>> import nltk >>> from nltk.stem import RegexpStemmer >>> stemmerregexp=RegexpStemmer('ing') >>> stemmerregexp.stem('working') 'work' >>> stemmerregexp.stem('happiness') 'happiness' >>> stemmerregexp.stem('pairing') 'pair' We can use RegexpStemmer in cases where stemming cannot be performed using PorterStemmer and LancasterStemmer. The SnowballStemmer class is used to perform stemming in 13 languages other than English. In order to perform stemming using SnowballStemmer, firstly, an instance is created in the language where stemming needs to be performed, and then using the stem() method, stepping is performed. Consider the following example to perform stemming in Spanish and French in NLTK using SnowballStemmer: >>> import nltk >>> from nltk.stem import SnowballStemmer >>> SnowballStemmer.languages ('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish') >>> spanishstemmer=SnowballStemmer('spanish') >>> spanishstemmer.stem('comiendo') 'com' >>> frenchstemmer=SnowballStemmer('french') >>> frenchstemmer.stem('manger') 'mang' Nltk.stem.api consists of the stemmer I class in which the stem function is performed. Consider the following code present in NLTK, which enables stemming to be performed: Class StemmerI(object): """ It is an interface that helps to eliminate morphological affixes from the tokens and the process is known as stemming. """ def stem(self, token): """ Eliminate affixes from token and stem is returned. """ raise NotImplementedError() Here's the code used to perform stemming using multiple stemmers: >>> import nltk >>> from nltk.stem.porter import PorterStemmer >>> from nltk.stem.lancaster import LancasterStemmer >>> from nltk.stem import SnowballStemmer >>> def obtain_tokens(): With open('/home/p/NLTK/sample1.txt') as stem: tok = nltk.word_tokenize(stem.read()) return tokens >>> def stemming(filtered): stem=[] for x in filtered: stem.append(PorterStemmer().stem(x)) return stem >>> if_name_=="_main_": tok= obtain_tokens() >>> print("tokens is %s")%(tok) >>> stem_tokens= stemming(tok) >>> print("After stemming is %s")%stem_tokens >>> res=dict(zip(tok,stem_tokens)) >>> print("{tok:stemmed}=%s")%(result) Understanding lemmatization Lemmatization is the process in which we transform a word into a form that has a different word category. The word formed after lemmatization is entirely different from what it was initially. Consider an example of lemmatization in NLTK: >>> import nltk >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer_output=WordNetLemmatizer() >>> lemmatizer_output.lemmatize('working') 'working' >>> lemmatizer_output.lemmatize('working',pos='v') 'work' >>> lemmatizer_output.lemmatize('works') 'work' WordNetLemmatizer may be defined as a wrapper around the so-called WordNet corpus, and it makes use of the morphy() function present in WordNetCorpusReader to extract a lemma. If no lemma is extracted, then the word is only returned in its original form. For example, for 'works', the lemma that is returned is in the singular form 'work'. This code snippet illustrates the difference between stemming and lemmatization: >>> import nltk >>> from nltk.stem import PorterStemmer >>> stemmer_output=PorterStemmer() >>> stemmer_output.stem('happiness') 'happi' >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer_output.lemmatize('happiness') 'happiness' In the preceding code, 'happiness' is converted to 'happi' by stemming it. Lemmatization can't find the root word for 'happiness', so it returns the word "happiness". Developing a stemmer for non-English languages Polyglot is a software that is used to provide models called morfessor models, which are used to obtain morphemes from tokens. The Morpho project's goal is to create unsupervised data-driven processes. Its focuses on the creation of morphemes, which are the smallest units of syntax. Morphemes play an important role in natural language processing. They are useful in automatic recognition and the creation of language. With the help of the vocabulary dictionaries of polyglot, morfessor models on 50,000 tokens of different languages was used. Here's the code to obtain a language table using a polyglot: from polyglot.downloader import downloader print(downloader.supported_languages_table("morph2")) The output obtained from the preceding code is in the form of these languages listed as follows: 1. Piedmontese language 2. Lombard language 3. Gan Chinese 4. Sicilian 5. Scots 6. Kirghiz, Kyrgyz 7. Pashto, Pushto 8. Kurdish 9. Portuguese 10. Kannada 11. Korean 12. Khmer 13. Kazakh 14. Ilokano 15. Polish 16. Panjabi, Punjabi 17. Georgian 18. Chuvash 19. Alemannic 20. Czech 21. Welsh 22. Chechen 23. Catalan; Valencian 24. Northern Sami 25. Sanskrit (Saṁskṛta) 26. Slovene 27. Javanese 28. Slovak 29. Bosnian-Croatian-Serbian 30. Bavarian 31. Swedish 32. Swahili 33. Sundanese 34. Serbian 35. Albanian 36. Japanese 37. Western Frisian 38. French 39. Finnish 40. Upper Sorbian 41. Faroese 42. Persian 43. Sinhala, Sinhalese 44. Italian 45. Amharic 46. Aragonese 47. Volapük 48. Icelandic 49. Sakha 50. Afrikaans 51. Indonesian 52. Interlingua 53. Azerbaijani 54. Ido 55. Arabic 56. Assamese 57. Yoruba 58. Yiddish 59. Waray-Waray 60. Croatian 61. Hungarian 62. Haitian; Haitian Creole 63. Quechua 64. Armenian 65. Hebrew (modern) 66. Silesian 67. Hindi 68. Divehi; Dhivehi; Mald... 69. German 70. Danish 71. Occitan 72. Tagalog 73. Turkmen 74. Thai 75. Tajik 76. Greek, Modern 77. Telugu 78. Tamil 79. Oriya 80. Ossetian, Ossetic 81. Tatar 82. Turkish 83. Kapampangan 84. Venetian 85. Manx 86. Gujarati 87. Galician 88. Irish 89. Scottish Gaelic; Gaelic 90. Nepali 91. Cebuano 92. Zazaki 93. Walloon 94. Dutch 95. Norwegian 96. Norwegian Nynorsk 97. West Flemish 98. Chinese 99. Bosnian 100. Breton 101. Belarusian 102. Bulgarian 103. Bashkir 104. Egyptian Arabic 105. Tibetan Standard, Tib... 106. Bengali 107. Burmese 108. Romansh 109. Marathi (Marāthī) 110. Malay 111. Maltese 112. Russian 113. Macedonian 114. Malayalam 115. Mongolian 116. Malagasy 117. Vietnamese 118. Spanish; Castilian 119. Estonian 120. Basque 121. Bishnupriya Manipuri 122. Asturian 123. English 124. Esperanto 125. Luxembourgish, Letzeb... 126. Latin 127. Uighur, Uyghur 128. Ukrainian 129. Limburgish, Limburgan... 130. Latvian 131. Urdu 132. Lithuanian 133. Fiji Hindi 134. Uzbek 135. Romanian, Moldavian, ... The necessary models can be downloaded using the following code: %%bash polyglot download morph2.en morph2.ar [polyglot_data] Downloading package morph2.en to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package morph2.en is already up-to-date! [polyglot_data] Downloading package morph2.ar to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package morph2.ar is already up-to-date! Consider this example that obtains output from a polyglot: from polyglot.text import Text, Word tokens =["unconditional" ,"precooked", "impossible", "painful", "entered"] for s in tokens: s=Word(s, language="en") print("{:<20}{}".format(s,s.morphemes)) unconditional ['un','conditional'] precooked ['pre','cook','ed'] impossible ['im','possible'] painful ['pain','ful'] entered ['enter','ed'] If tokenization is not performed properly, then we can perform morphological analysis for the process of splitting text into its original constituents: sent="Ihopeyoufindthebookinteresting" para=Text(sent) para.language="en" para.morphemes WordList(['I','hope','you','find','the','book','interesting']) A morphological analyzers Morphological analysis may be defined as the process of obtaining grammatical information about a token given its suffix information. Morphological analysis can be performed in three ways: Morpheme-based morphology (or the item and arrangement approach), Lexeme-based morphology (or the item and process approach), and Word-based morphology (or the word and paradigm approach). A morphological analyzer may be defined as a program that is responsible for the analysis of the morphology of a given input token. It analyzes a given token and generates morphological information, such as gender, number, class, and so on, as an output. In order to perform morphological analysis on a given non-whitespace token, pyEnchant dictionary is used. Consider the following code that performs morphological analysis: >>> import enchant >>> s = enchant.Dict("en_US") >>> tok=[] >>> def tokenize(st1): if not st1:return for j in xrange(len(st1),-1,-1): if s.check(st1[0:j]): tok.append(st1[0:i]) st1=st[j:] tokenize(st1) break >>> tokenize("itismyfavouritebook") >>> tok ['it', 'is', 'my','favourite','book'] >>> tok=[ ] >>> tokenize("ihopeyoufindthebookinteresting") >>> tok ['i','hope','you','find','the','book','interesting'] We can determine the category of a word as follows: Morphological hints: Suffix information helps us to detect the category of a word. For example, -ness and –ment suffixes exist with nouns. Syntactic hints: Contextual information is conducive in determining the category of a word. For example, if we have found a word that has a noun category, then syntactic hints will be useful in determining whether an adjective will appear before the noun or after the noun category. Semantic hints: A semantic hint is also useful in determining the category of a word. For example, if we already know that a word represents the name of a location, then it will fall under the noun category. Open class: This refers to the class of words that are not fixed and each day, their number keeps on increasing whenever a new word is added to the list. Words in an open class are usually in the form of nouns. Prepositions are mostly a closed class. Morphology captured by the part of speech tagset: Part of Speech tagset capture information that helps us to perform morphology. For example, the word 'plays' would appear with the third person and singular noun. Omorfi (the open morphology of Finnish) is a package that has been licensed by version 3 of GNU GPL. It is used for the purpose of performing numerous tasks such as language modeling, morphological analysis, rule-based machine translations, information retrieval, statistical machine translations, morphological segmentation, ontologies, and spell checking and correction. A morphological generator A morphological generator is a program that performs the task of morphological generations. Morphological generation may be considered the opposite of morphological analysis. Here, given the description of a word in terms of its number, category, stem, and so on, the original word is retrieved. For example, if root = go, Part of Speech = verb, tense= present, and if it occurs along with a third person and singular subject, then the morphological generator would generate its surface form, that is, goes. There are many Python-based software that perform morphological analysis and generation. Some of them are as follows: ParaMorfo: This is used to perform the morphological generation and analysis of Spanish and Guarani nouns, adjectives, and verbs HornMorpho: This is used for the morphological generation and analysis of Oromo and Amharic nouns and verbs as well as Tigrinya verbs AntiMorfo: This is used for the morphological generation and analysis of Quechua adjectives, verbs, and nouns as well as Spanish verbs MorfoMelayu: This is used for the morphological analysis of Malay words Other examples of software that is used to perform morphological analysis and generation are as follows: Morph is a morphological generator and analyzer for the English language and the RASP system Morphy is a morphological generator, analyzer, and POS tagger for German Morphisto is a morphological generator and analyzer for German Morfette performs supervised learning (inflectional morphology) for Spanish and French Search engines PyStemmer 1.0.1 consists of Snowball stemming algorithms that are conducive for performing information retrieval tasks and the construction of a search engine. It consists of the Porter stemming algorithm and many other stemming algorithms that are useful for the purpose of performing stemming and information retrieval tasks in many languages, including many European languages. We can construct a vector space search engine by converting the texts into vectors. Here are the steps needed to construct a vector space search engine: Stemming and elimination of stop words. A stemmer is a program that accepts words and converts them into stems. Tokens that have same stem almost have the same meanings. Stop words are also eliminated from text. Consider the following code for the removal of stop words and tokenization: def eliminatestopwords(self,list): " " " Eliminate words which occur often and have not much significance from context point of view. " " " return[ word for word in list if word not in self.stopwords ] def tokenize(self,string): " " " Perform the task of splitting text into stop words and tokens " " " Str=self.clean(str) Words=str.split(" ") return [self.stemmer.stem(word,0,len(word)-1) for word in words] Mapping keywords into vector dimensions.Here's the code required to perform the mapping of keywords into vector dimensions: def obtainvectorkeywordindex(self, documentList): " " " In the document vectors, generate the keyword for the given position of element " " " #Perform mapping of text into strings vocabstring = " ".join(documentList) vocablist = self.parser.tokenise(vocabstring) #Eliminate common words that have no search significance vocablist = self.parser.eliminatestopwords(vocablist) uniqueVocablist = util.removeDuplicates(vocablist) vectorIndex={} offset=0 #Attach a position to keywords that performs mapping with dimension that is used to depict this token for word in uniqueVocablist: vectorIndex[word]=offset offset+=1 return vectorIndex #(keyword:position) Mapping of text strings to vectorsHere, a simple term count model is used. The code to convert text strings into vectors is as follows: def constructVector(self, wordString): # Initialise the vector with 0's Vector_val = [0] * len(self.vectorKeywordIndex) tokList = self.parser.tokenize(tokString) tokList = self.parser.eliminatestopwords(tokList) for word in toklist: vector[self.vectorKeywordIndex[word]] += 1; # simple Term Count Model is used return vector Searching similar documents By finding the cosine of an angle between the vectors of a document, we can prove whether two given documents are similar or not. If the cosine value is 1, then the angle value is 0 degrees and vectors are said to be parallel (this means that documents are related). If the cosine value is 0 and the value of the angle is 90 degrees, then vectors are said to be perpendicular (this means that documents are not related). This is the code to compute the cosine between the text vector using scipy: def cosine(vec1, vec2): """ cosine = ( X * Y ) / ||X|| x ||Y|| """ return float(dot(vec1,vec2) / (norm(vec1) * norm(vec2))) Search keywords We perform the mapping of keywords to a vector space. We construct a temporary text that represents items to be searched and then compare it with document vectors with the help of a cosine measurement. Here is the following code needed to search for the vector space: def searching(self,searchinglist): """ search for text that are matched on the basis of list of items """ askVector = self.buildQueryVector(searchinglist) ratings = [util.cosine(askVector, textVector) for textVector in self.documentVectors] ratings.sort(reverse=True) return ratings The following code can be used to detect languages from a source text: >>> import nltk >>> import sys >>> try: from nltk import wordpunct_tokenize from nltk.corpus import stopwords except ImportError: print( 'Error has occured') #---------------------------------------------------------------------- >>> def _calculate_languages_ratios(text): """ Compute probability of given document that can be written in different languages and give a dictionary that appears like {'german': 2, 'french': 4, 'english': 1} """ languages_ratios = {} ''' nltk.wordpunct_tokenize() splits all punctuations into separate tokens wordpunct_tokenize("I hope you like the book interesting .") [' I',' hope ','you ','like ','the ','book' ,'interesting ','.'] ''' tok = wordpunct_tokenize(text) wor = [word.lower() for word in tok] # Compute occurence of unique stopwords in a text for language in stopwords.fileids(): stopwords_set = set(stopwords.words(language)) words_set = set(words) common_elements = words_set.intersection(stopwords_set) languages_ratios[language] = len(common_elements) # language "score" return languages_ratios #---------------------------------------------------------------- >>> def detect_language(text): """ Compute the probability of given text that is written in different languages and obtain the one that is highest scored. It makes use of stopwords calculation approach, finds out unique stopwords present in a analyzed text. """ ratios = _calculate_languages_ratios(text) most_rated_language = max(ratios, key=ratios.get) return most_rated_language if __name__=='__main__': text = ''' All over this cosmos, most of the people believe that there is an invisible supreme power that is the creator and the runner of this world. Human being is supposed to be the most intelligent and loved creation by that power and that is being searched by human beings in different ways into different things. As a result people reveal His assumed form as per their own perceptions and beliefs. It has given birth to different religions and people are divided on the name of religion viz. Hindu, Muslim, Sikhs, Christian etc. People do not stop at this. They debate the superiority of one over the other and fight to establish their views. Shrewd people like politicians oppose and support them at their own convenience to divide them and control them. It has intensified to the extent that even parents of a new born baby teach it about religious differences and recommend their own religion superior to that of others and let the child learn to hate other people just because of religion. Jonathan Swift, an eighteenth century novelist, observes that we have just enough religion to make us hate, but not enough to make us love one another. The word 'religion' does not have a derogatory meaning - A literal meaning of religion is 'A personal or institutionalized system grounded in belief in a God or Gods and the activities connected with this'. At its basic level, 'religion is just a set of teachings that tells people how to lead a good life'. It has never been the purpose of religion to divide people into groups of isolated followers that cannot live in harmony together. No religion claims to teach intolerance or even instructs its believers to segregate a certain religious group or even take the fundamental rights of an individual solely based on their religious choices. It is also said that 'Majhab nhi sikhata aaps mai bair krna'. But this very majhab or religion takes a very heinous form when it is misused by the shrewd politicians and the fanatics e.g. in Ayodhya on 6th December, 1992 some right wing political parties and communal organizations incited the Hindus to demolish the 16th century Babri Masjid in the name of religion to polarize Hindus votes. Muslim fanatics in Bangladesh retaliated and destroyed a number of temples, assassinated innocent Hindus and raped Hindu girls who had nothing to do with the demolition of Babri Masjid. This very inhuman act has been presented by Taslima Nasrin, a Banglsdeshi Doctor-cum-Writer in her controversial novel 'Lajja' (1993) in which, she seems to utilizes fiction's mass emotional appeal, rather than its potential for nuance and universality. ''' >>> language = detect_language(text) >>> print(language) The preceding code will search for stop words and detect the language of the text, which is English. Summary In this article, we discussed stemming, lemmatization, and morphological analysis and generation. Resources for Article: Further resources on this subject: How is Python code organized[article] Machine learning and Python – the Dream Team[article] Putting the Fun in Functional Python[article]
Read more
  • 0
  • 0
  • 4569

article-image-hbase-administration-performance-tuning
Packt
21 Aug 2012
8 min read
Save for later

HBase Administration, Performance Tuning

Packt
21 Aug 2012
8 min read
Setting up Hadoop to spread disk I/O Modern servers usually have multiple disk devices to provide large storage capacities. These disks are usually configured as RAID arrays, as their factory settings. This is good for many cases but not for Hadoop. The Hadoop slave node stores HDFS data blocks and MapReduce temporary files on its local disks. These local disk operations benefit from using multiple independent disks to spread disk I/O. In this recipe, we will describe how to set up Hadoop to use multiple disks to spread its disk I/O. Getting ready We assume you have multiple disks for each DataNode node. These disks are in a JBOD (Just a Bunch Of Disks) or RAID0 configuration. Assume that the disks are mounted at /mnt/d0, /mnt/d1, …, /mnt/dn, and the user who starts HDFS has write permission on each mount point. How to do it... In order to set up Hadoop to spread disk I/O, follow these instructions: On each DataNode node, create directories on each disk for HDFS to store its data blocks: hadoop$ mkdir -p /mnt/d0/dfs/datahadoop$ mkdir -p /mnt/d1/dfs/data…hadoop$ mkdir -p /mnt/dn/dfs/data Add the following code to the HDFS configuration file (hdfs-site.xml): hadoop@master1$ vi $HADOOP_HOME/conf/hdfs-site.xml <property> <name>dfs.data.dir</name> <value>/mnt/d0/dfs/data,/mnt/d1/dfs/data,...,/mnt/dn/dfs/data</value> </property> Sync the modified hdfs-site.xml file across the cluster: hadoop@master1$ for slave in `cat $HADOOP_HOME/conf/slaves`do rsync -avz $HADOOP_HOME/conf/ $slave:$HADOOP_HOME/conf/done Restart HDFS: hadoop@master1$ $HADOOP_HOME/bin/stop-dfs.shhadoop@master1$ $HADOOP_HOME/bin/start-dfs.sh How it works... We recommend JBOD or RAID0 for the DataNode disks, because you don't need the redundancy of RAID, as HDFS ensures its data redundancy using replication between nodes. So, there is no data loss when a single disk fails. Which one to choose, J BOD or RAID0? You will theoretically get better performance from a JBOD configuration than from a RAID configuration. This is because, in a RAID configuration, you have to wait for the slowest disk in the array to complete before the entire write operation can complete, which makes the average I/O time equivalent to the slowest disk's I/O time. In a JBOD configuration, operations on a faster disk will complete independently of the slower ones, which makes the average I/O time faster than the slowest one. However, enterprise-class RAID cards might make big differences. You might want to benchmark your JBOD and RAID0 configurations before deciding which one to go with. For both JBOD and RAID0 configurations, you will have the disks mounted at different paths. The key point here is to set the dfs.data.dirproperty to all the directories created on each disk. The dfs.data.dirproperty specifies where the DataNode should store its local blocks. By setting it to comma-separated multiple directories, DataNode stores its blocks across all the disks in round robin fashion. This causes Hadoop to efficiently spread disk I/O to all the disks. Warning Do not leave blanks between the directory paths in the dfs.data.dir property value, or it won't work as expected. You will need to sync the changes across the cluster and restart HDFS to apply them. There's more... If you run MapReduce, as MapReduce stores its temporary files on TaskTracker's local file system, you might also like to set up MapReduce to spread its disk I/O: On each TaskTracker node, create directories on each disk for MapReduce to store its intermediate data files: hadoop$ mkdir -p /mnt/d0/mapred/localhadoop$ mkdir -p /mnt/d1/mapred/local…hadoop$ mkdir -p /mnt/dn/mapred/local Add the following to MapReduce's configuration file (mapred-site.xml): hadoop@master1$ vi $HADOOP_HOME/conf/mapred-site.xml <property> <name>mapred.local.dir</name> <value>/mnt/d0/mapred/local,/mnt/d1/mapred/local,...,/mnt/dn/mapred/local</value> </property> Sync the modified mapred-site.xml file across the cluster and restart MapReduce. MapReduce generates a lot of temporary files on TaskTrackers' local disks during its execution. Like HDFS, setting up multiple directories on different disks helps spread MapReduce disk I/O significantly. Using network topology script to make Hadoop rack-aware Hadoop has the concept of "Rack Awareness ". Administrators are able to define the rack of each DataNode in the cluster. Making Hadoop rack-aware is extremely important because: Rack awareness prevents data loss Rack awareness improves network performance In this recipe, we will describe how to make Hadoop rack-aware and why it is important. Getting ready You will need to know the rack to which each of your slave nodes belongs. Log in to the master node as the user who started Hadoop. How to do it... The following steps describe how to make Hadoop rack-aware: Create a topology.sh script and store it under the Hadoop configuration directory. Change the path for topology.data, in line 3, to fit your environment: hadoop@master1$ vi $HADOOP_HOME/conf/topology.sh while [ $# -gt 0 ] ; do nodeArg=$1 exec< /usr/local/hadoop/current/conf/topology.data result="" while read line ; do ar=( $line ) if [ "${ar[0]}" = "$nodeArg" ] ; then result="${ar[1]}" fi done shift if [ -z "$result" ] ; then echo -n "/default/rack " else echo -n "$result " fi done Don't forget to set the execute permission on the script file: hadoop@master1$ chmod +x $HADOOP_HOME/conf/topology.sh Create a topology.data file, as shown in the following snippet; change the IP addresses and racks to fit your environment: hadoop@master1$ vi $HADOOP_HOME/conf/topology.data10.161.30.108 /dc1/rack110.166.221.198 /dc1/rack210.160.19.149 /dc1/rack3 Add the following to the Hadoop core configuration file (core-site.xml): hadoop@master1$ vi $HADOOP_HOME/conf/core-site.xml <property> <name>topology.script.file.name</name> <value>/usr/local/hadoop/current/conf/topology.sh</value> </property> Sync the modified files across the cluster and restart HDFS and MapReduce. Make sure HDFS is now rack-aware. If everything works well, you should be able to find something like the following snippet in your NameNode log file: 2012-03-10 13:43:17,284 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack3/10.160.19.149:50010 2012-03-10 13:43:17,297 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack1/10.161.30.108:50010 2012-03-10 13:43:17,429 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack2/10.166.221.198:50010 Make sure MapReduce is now rack-aware. If everything works well, you should be able to find something like the following snippet in your JobTracker log file: 2012-03-10 13:50:38,341 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack3/ip-10-160-19-149.us-west-1.compute.internal 2012-03-10 13:50:38,485 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack1/ip-10-161-30-108.us-west-1.compute.internal 2012-03-10 13:50:38,569 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack2/ip-10-166-221-198.us-west-1.compute.internal How it works... The following diagram shows the concept of Hadoop rack awareness: Each block of the HDFS files will be replicated to multiple DataNodes, to prevent loss of all the data copies due to failure of one machine. However, if all copies of data happen to be replicated on DataNodes in the same rack, and that rack fails, all the data copies will be lost. So to avoid this, the NameNode needs to know the network topology in order to use that information to make intelligent data replication. As shown in the previous diagram, with the default replication factor of three, two data copies will be placed on the machines in the same rack, and another one will be put on a machine in a different rack. This ensures that a single rack failure won't result in the loss of all data copies. Normally, two machines in the same rack have more bandwidth and lower latency between them than two machines in different racks. With the network topology information, Hadoop is able to maximize network performance by reading data from proper DataNodes. If data is available on the local machine, Hadoop will read data from it. If not, Hadoop will try reading data from a machine in the same rack, and if it is available on neither, data will be read from machines in different racks. In step 1, we create a topology.sh script. The script takes DNS names as arguments and returns network topology (rack) names as the output. The mapping of DNS names to network topology is provided by the topology.data file, which was created in step 2. If an entry is not found in the topology.data file, the script returns /default/rack as a default rack name. Note that we use IP addresses, and not hostnames in the topology. data file. There is a known bug that Hadoop does not correctly process hostnames that start with letters "a" to "f". Check HADOOP-6682 for more details. In step 3, we set the topology.script.file.name property in core-site.xml, telling Hadoop to invoke topology.sh to resolve DNS names to network topology names. After restarting Hadoop, as shown in the logs of steps 5 and 6, HDFS and MapReduce add the correct rack name as a prefix to the DNS name of slave nodes. This indicates that the HDFS and MapReduce rack awareness work well with the aforementioned settings.
Read more
  • 0
  • 0
  • 4569

article-image-customizing-heat-maps-intermediate
Packt
22 Feb 2016
11 min read
Save for later

Customizing heat maps (Intermediate)

Packt
22 Feb 2016
11 min read
This article will help you explore more advanced functions to customize the layout of the heat maps. The main focus lies on the usage of different color palettes, but we will also cover other useful features, such as cell notes that will be used in this recipe. (For more resources related to this topic, see here.) To ensure that our heat maps look good in any situation, we will make use of different color palettes in this recipe, and we will even learn how to create our own. Further, we will add some more extras to our heat maps including visual aids such as cell note labels, which will make them even more useful and accessible as a tool for visual data analysis. The following image shows a heat map with cell notes and an alternative color palette created from the arabidopsis_genes.csv data set: Getting ready Download the 5644OS_03_01.r script and the Arabidopsis_genes.csv data set from your account at http://www.packtpub.com and save it to your hard drive. I recommend that you save the script and data file to the same folder on your hard drive. If you execute the script from a different location to the data file, you will have to change the current R working directory accordingly. The script will check automatically if any additional packages need to be installed in R. How to do it... Execute the following code in R via the 5644OS_03_01.r script and take a look at the PDF file custom_heatmaps.pdf that will be created in the current working directory: ### loading packages if (!require("gplots")) { install.packages("gplots", dependencies = TRUE) library(RColorBrewer) } if (!require("RColorBrewer")) { install.packages("RColorBrewer", dependencies = TRUE) library(RColorBrewer) } ### reading in data gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names ### setting heatmap.2() default parameters heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) pdf("custom_heatmaps.pdf") ### 1) customizing colors # 1.1) in-built color palettes heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") # 1.2) RColorBrewer palettes heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") # 1.3) creating own color palettes my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") # 1.4) gray scale heat2(col = gray(level = (0:100)/100), main ="1.4) Gray Scale") ### 2) adding cell notes fold_change <- 2^gene_data rounded_fold_changes <- round(rounded_fold_changes, 2) heat2(cellnote = rounded, notecex = 0.5, notecol = "black", col = my_palette, main = "2) Cell Notes") ### 3) adding column side colors heat2(ColSideColors = c("red", "gray", "red", rep("green",13)), main = "3) ColSideColors") dev.off() How it works... Primarily, we will be using read.csv() and heatmap.2() to read in data into R and construct our heat maps. In this recipe, however, we will focus on advanced features to enhance our heat maps, such as customizing color and other visual elements: Inspecting the arabidopsis_genes.csv data set: The arabidopsis_genes.csv file contains a compilation of gene expression data from the model plant Arabidopsis thaliana. I obtained the freely available data of 16 different genes as log 2 ratios of target and reference gene from the Arabidopsis eFP Browser (http://bar.utoronto.ca/efp_arabidopsis/). For each gene, expression data of 47 different areas of the plant is available in this data file. Reading the data and converting it into a numeric matrix: We have to convert the data table into a numeric matrix first before we can construct our heat maps: gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names Creating a customized heatmap.2() function: To reduce typing efforts, we are defining our own version of the heatmap.2() function now, where we will include some arguments that we are planning to keep using throughout this recipe: heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) So, each time we call our newly defined heat2() function, it will behave similar to the heatmap.2() function, except for the additional arguments that we will pass along. We also include a new argument, black, for the tracecol parameter, to better distinguish the density plot in the color key from the background. The built-in color palettes: There are four more color palettes available in the base R that we could use instead of the heat.colors palette: rainbow, terrain.colors, topo.colors, and cm.colors. So let us make use of the terrain.colors color palette now, which will give us a nice color transition from green over yellow to rose: heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") Every number for the parameter n that is larger than the default value 12 will add additional colors, which will make the transition smoother. A value of 1000 for the n parameter should be more than sufficient to make the transition between the individual colors indistinguishable to the human eye. The following image shows a side-by-side comparison of the heat.colors and terrain.colors color palettes using a different number of color shades: Further, it is also possible to reverse the direction of the color transition. For example, if we want to have a heat.color transition from yellow to red instead of red to yellow in our heat map, we could simply define a reverse function: rev_heat.colors <- function(x) rev(heat.colors(x)) heat2(col = rev_heat.colors(500)) RColorBrewer palettes: A lot of color palettes are available from the RColorBrewer package. To see how they look like, you can type display.brewer.all() into the R command-line after loading the RColorBrewer package. However, in contrast to the dynamic range color palettes that we have seen previously, the RColorBrewer palettes have a distinct number of different colors. So to select all nine colors from the YlOrRd palette, a gradient from yellow to red, we use the following command: heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") The following image gives you a good overview of all the different color palettes that are available from the RColorBrewer package: Creating our own color palettes: Next, we will see how we can create our own color palettes. A whole bunch of different colors are already defined in R. An overview of those colors can be seen by typing colors() into the command line of R. The most convenient way to assign new colors to a color palette is using hex colors (hexadecimal colors). Many different online tools are freely available that allow us to obtain the necessary hex codes. A great example is color picker (http://www.colorpicker.com), which allows us to choose from a rich color table and provides us with the corresponding hex codes. Once we gather all the hexadecimal codes for the colors that we want to use for our color palette, we can assign them to a variable as we have done before with the explicit color names: my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") This is a very handy approach for creating a color key with very distinct colors. However, the downside of this method is that we have to provide a lot of different colors if we want to create a smooth color gradient; we have used 1000 different colors for the terrain.color() palette to get a smooth transition in the color key! Using colorRampPalette for smoother color gradients: A convenient approach to create a smoother color gradient is to use the colorRampPalette() function, so we don't have to insert all the different colors manually. The function takes a vector of different colors as an argument. Here, we provide three colors: blue for the lower end of the color key, yellow for the middle range, and red for the higher end. As we did it for the in-built color palettes, such as heat.color, we assign the value 1000 to the n parameter: my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") In this case, it is more convenient to use discrete color names over hex colors, since we are using the colorRampPalette() function to create a gradient and do not need all the different shades of a particular color. Grayscales: It might happen that the medium or device that we use to display our heat maps does not support colors. Under these circumstances, we can use the gray palette to create a heat map that is optimized for those conditions. The level parameter of the gray() function takes a vector with values between 0 and 1 as an argument, where 0 represents black and 1 represents white, respectively. For a smooth gradient, we use a vector with 100 equally spaced shades of gray ranging from 0 to 1. heat2(col = gray(level = (0:200)/200), main ="1.4) Gray Scale") We can make use of the same color palettes for the levelplot() function too. It works in a similar way as it did for the heatmap.2() function that we are using in this recipe. However, inside the levelplot() function call, we must use col.regions instead of the simple col, so that we can include a color palette argument. Adding cell notes to our heat map: Sometimes, we want to show a data set along with our heat map. A neat way is to use so-called cell notes to display data values inside the individual heat map cells. The underlying data matrix for the cell notes does not necessarily have to be the same numeric matrix we used to construct our heat map, as long as it has the same number of rows and columns. As we recall, the data we read from arabidopsis_genes.csv resembles log 2 ratios of sample and reference gene expression levels. Let us calculate the fold changes of the gene expression levels now and display them—rounded to two digits after the decimal point—as cell notes on our heat map: fold_change <- 2^gene_data rounded_fold_changes <- round(fold_change, 2) heat2(cellnote = rounded_fold_changes, notecex = 0.5, notecol = "black", col = rev_heat.colors, main = "Cell Notes") The notecex parameter controls the size of the cell notes. Its default size is 1, and every argument between 0 and 1 will make the font smaller, whereas values larger than 1 will make the font larger. Here, we decreased the font size of the cell notes by 50 percent to fit it into the cell boundaries. Also, we want to display the cell notes in black to have a nice contrast to the colored background; this is controlled by the notecol parameter. Row and column side colors: Another approach to pronounce certain regions, that is, rows or columns on the heat map is to make use of row and column side colors. The ColSideColors argument will place a colored box between the dendrogram and heat map that can be used to annotate certain columns. We pass our vector with colors to ColSideColors, where its length must be equal to the number of columns of the heat map. Here, we want to color the first and third column red, the second one gray, and all the remaining 13 columns green: heat2(ColSideColors = c("red", "gray", "red", rep("green", 13)), main = "ColSideColors") You can see in the following image how the column side colors look like when we include the ColSideColors argument as shown previously: Attentive readers may have noticed that the order of colors in the column color box slightly differs from the order of colors we passed as a vector to ColSideColors. We see red two times next to each other, followed by a green and a gray box. This is due to the fact that the columns of our heat map have been reordered by the hierarchical clustering algorithm. Summary To learn more about the similar technology, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Instant R Starter (https://www.packtpub.com/big-data-and-business-intelligence/instant-r-starter-instant) Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition) Mastering RStudio – Develop, Communicate, and Collaborate with R (https://www.packtpub.com/application-development/mastering-rstudio-%E2%80%93-develop-communicate-and-collaborate-r) Resources for Article: Further resources on this subject: Data Analysis Using R[article] Big Data Analysis[article] Big Data Analysis (R and Hadoop)[article]
Read more
  • 0
  • 0
  • 4540

article-image-analyzing-social-networks-facebook
Packt
20 Jun 2017
15 min read
Save for later

Analyzing Social Networks with Facebook

Packt
20 Jun 2017
15 min read
In this article by Raghav Bali, Dipanjan Sarkar and Tushar Sharma, the authors of the book Learning Social Media Analytics with R, we got a good flavor of the various aspects related to the most popular social micro-blogging platform, Twitter. In this article, we will look more closely at the most popular social networking platform, Facebook. With more than 1.8 billion monthly active users, over 18 billion dollars annual revenue and record breaking acquisitions for popular products including Oculus, WhatsApp and Instagram have truly made Facebook the core of the social media network today. (For more resources related to this topic, see here.) Before we put Facebook data under the microscope, let us briefly look at Facebook’s interesting origins! Like many popular products, businesses and organizations, Facebook too had a humble beginning. Originally starting off as Mark Zuckerberg’s brainchild in 2004, it was initially known as “Thefacebook” located at thefacebook.com, which was branded as an online social network, connecting university and college students. While this social network was only open to Harvard students in the beginning, it soon expanded within a month by including students from other popular universities. In 2005, the domain facebook.com was finally purchased and “Facebook” extended its membership to employees of companies and organizations for the first time. Finally in 2006, Facebook was finally opened to everyone above 13 years of age and having a valid email address. The following snapshot shows us how the look and feel of the Facebook platform has evolved over the years! Facebook’s evolving look over time While Facebook has a primary website, also known as a web application, it has also launched mobile applications for the major operating systems on handheld devices. In short, Facebook is not just a social network website but an entire platform including a huge social network of connected people and organizations through friends, followers and pages. We will leverage Facebook’s social “Graph API” to access actual Facebook data to perform various analyses. Users, brands, business, news channels, media houses, retail stores and many more are using Facebook actively on a daily basis for producing and consuming content. This generates vast amount of data and a substantial amount of this is available to users through its APIs.  From a social media analytics perspective, this is really exciting because this treasure trove of data with easy to access APIs and powerful open source libraries from R, gives us enormous potential and opportunity to get valuable information from analyzing this data in various ways. We will follow a structured path in this article and cover the following major topics sequentially to ensure that you do not get overwhelmed with too much content at once. Accessing Facebook data Analyzing your personal social network Analyzing an English football social network Analyzing English football clubs’ brand page engagements We will use libraries like Rfacebook, igraph and ggplot2 to retrieve, analyze and visualize data from Facebook. All the following sections of the book assume that you have a Facebook account which is necessary to access data from the APIs and analyze it. In case you do not have an account, do not despair. You can use the data and code files for this article to follow along with the hands-on examples to gain a better understanding of the concepts of social network and engagement analysis.    Accessing Facebook data You will find a lot of content in several books and on the web about various techniques to access and retrieve data from Facebook. There are several official ways of doing this which include using the Facebook Graph API either directly through low level HTTP based calls or indirectly through higher level abstract interfaces belonging to libraries like Rfacebook. Some alternate ways of retrieving Facebook data would be to use registered applications on Facebook like Netvizz or the GetNet application built by Lada Adamic, used in her very popular “Social Network Analysis” course (Unfortunately http://snacourse.com/getnet is not working since Facebook completely changed its API access permissions and privacy settings). Unofficial ways include techniques like web scraping and crawling to extract data. Do note though that Facebook considers this to be a violation of its terms and conditions of accessing data and you should try and avoid crawling Facebook for data especially if you plan to use it for commercial purposes. In this section, we will take a closer look at the Graph API and the Rfacebook package in R. The main focus will be on how you can extract data from Facebook using both of them. Understanding the Graph API To start using the Graph API, you would need to have an account on Facebook to be able to use the API. You can access the API in various ways. You can create an application on Facebook by going to https://developers.facebook.com/apps/ and then create a long-lived OAuth access token using the fbOAuth(…)function from the Rfacebook package. This enables R to make calls to the Graph API and you can also store this token on the disk and load it for future use. An easier way is to create a short-lived token which would let you access the API data for about two hours by going to the Facebook Graph API Explorer page which is available at https://developers.facebook.com/tools/explorer and get a temporary access token from there. The following snapshot depicts how to get an access token for the Graph API from Facebook. Facebook’s Graph API explorer On clicking “Get User Access Token” in the above snapshot, it will present a list of checkboxes with various permissions which you might need for accessing data including user data permissions, events, groups and pages and other miscellaneous permissions. You can select the ones you need and click on the “Get Access Token” button in the prompt. This will generate a new access token the field depicted in the above snapshot and you can directly copy and use it to retrieve data in R. Before going into that, we will take a closer look at the Graph API explorer which directly allows you to access the API from your web browser itself and helps if you want to do some quick exploratory analysis. A part of it is depicted in the above snapshot. The current version of the API when writing this book is v2.8 which you can see in the snapshot beside the GET resource call. Interestingly, the Graph API is so named because Facebook by itself can be considered as a huge social graph where all the information can be classified into the following three categories. Nodes: These are basically users, pages, photos and so on. Nodes indicate a focal point of interest which is connected to other points. Edges: These connect various nodes together forming the core social graph and these connections are based on various relations like friends, followers and so on. Fields: These are specific attributes or properties about nodes, an example would be a user’s address, birthday, name and so on. Like we mentioned before, the API is HTTP based and you can make HTTPGET requests to nodes or edges and all requests are passed to graph.facebook.com to get data. Each node usually has a specific identifier and you can use it for querying information about a node as depicted in the following snippet. GET graph.facebook.com /{node-id} You can also use edge names in addition to the identifier to get information about the edges of the node. The following snippet depicts how you can do the same. GET graph.facebook.com /{node-id}/{edge-name} The following snapshot shows us how we can get information about our own profile. Querying your details in the Graph API explorer Now suppose, I wanted to retrieve information about a Facebook page,“Premier League” which represents the top tier competition in English Football using its identifier and also take a look at its liked pages. I can do the same using the following request. Querying information about a Facebook Page using the Graph API explorer Thus from the above figure, you can clearly see the node identifier, page name and likes for the page, “Premier League”. It must be clear by now that all API responses are returned in the very popular JSON format which is easy to parse and format as needed for analysis. Besides this, there also used to be another way of querying the social graph in Facebook, which was known as FQL or Facebook Query Language, an SQL like interface for querying and retrieving data. Unfortunately, Facebook seems to have deprecated its use and hence covering it would be out of our present scope. Now that you have a firm grasp on the syntax of the Graph API and have also seen a few examples of how to retrieve data from Facebook, we will take a closer look at the Rfacebook package. Understanding Rfacebook Since we will be accessing and analyzing data from Facebook using R, it makes sense to have some robust mechanism to directly query Facebook and retrieve data instead of going to the browser every time like we did in the earlier section. Fortunately, there is an excellent package in R called Rfacebook which has been developed by Pablo Barberá. You can either install it from CRAN or get its most updated version from GitHub. The following snippet depicts how you can do the same. Remember you might need to install the devtools package if you don’t have it already, to download and install the latest version of the Rfacebook package from GitHub. install.packages("Rfacebook") # install from CRAN # install from GitHub library(devtools) install_github("pablobarbera/Rfacebook/Rfacebook") Once you install the package, you can load up the package using load(Rfacebook) and start using it to retrieve data from Facebook by using the access token you generated earlier. The following snippet shows us how you can access your own details like we had mentioned in the previous section, but this time by using R. > token = 'XXXXXX' > me <- getUsers("me", token=token) > me$name [1] "Dipanjan Sarkar" > me$id [1] "1026544" The beauty of this package is that you directly get the results in curated and neatly formatted data frames and you do not need to spend extra time trying to parse the raw JSON response objects from the Graph API. The package is well documented and has high level functions for accessing personal profile data on Facebook as well as page and group level data points. We will now take a quick look at Netvizz a Facebook application, which can also be used to extract data easily from Facebook. Understanding Netvizz The Netvizz application was developed by Bernhard Rieder and is a tool which can be used to extract data from Facebook pages, groups, get statistics about links and also extract social networks from Facebook pages based on liked pages from each connected page in the network. You can access Netvizz at https://apps.facebook.com/netvizz/ and on registering the application on your profile, you will be able to see the following screen. The Netvizz application interface From the above app snapshot, you can see that there are various links based on the type of operation you want to execute to extract data. Feel free to play around with this tool and we will be using its “page like network” capability later on in one of our analyses in a future section. Data Access Challenges There are several challenges with regards to accessing data from Facebook. Some of the major issues and caveats have been mentioned in the following points: Facebook will keep evolving and updating its data access APIs and this can and will lead to changes and deprecation of older APIs and access patterns just like FQL was deprecated. Scope of data available keeps changing with time and evolving of Facebook’s API and privacy settings. For instance we can no longer get details of all our friends from the API any longer. Libraries and Tools built on top of the API can tend to break with changes to Facebook’s APIs and this has happened before with Rfacebook as well as Netvizz. Besides this, Lada Adamic’s GetNet application has stopped working permanently ever since Facebook changed the way apps are created and the permissions they require. You can get more information about it here http://thepoliticsofsystems.net/2015/01/the-end-of-netvizz/ Thus what was used in the book today for data retrieval might not be working completely tomorrow if there are any changes in the APIs though it is expected it will be working fine for at least the next couple of years. However to prevent any hindrance on analyzing Facebook data, we have provided the datasets we used in most of our analyses except personal networks so that you can still follow along with each example and use-case. Personal names have been anonymized wherever possible to protect their privacy. Now that we have a good idea about Facebook’s Graph API and how to access data, let’s analyze some social networks! Analyzing your personal social network Like we had mentioned before, Facebook by itself is a massive social graph, connecting billions of users, brands and organization. Consider your own Facebook account if you have one. You will have several friends which are your immediate connections, they in turn will be having their own set of friends including you and you might be friends with some of them and so on. Thus you and your friends form the nodes of the network and edges determine the connections. In this section we will analyze a small network of you and your immediate friends and also look at how we can extract and analyze some properties from the network. Before we jump into our analysis, we will start by loading the necessary packages needed which are mentioned in the following snippet and storing the Facebook Graph API access token in a variable. library(Rfacebook) library(gridExtra) library(dplyr) # get the Graph API access token token = ‘XXXXXXXXXXX’ You can refer to the file fb_personal_network_analysis.R for code snippets used in the examples depicted in this section. Basic descriptive statistics In this section, we will try to get some basic information and descriptive statistics on the same from our personal social network on Facebook. To start with let us look at some details of our own profile on Facebook using the following code. # get my personal information me <- getUsers("me", token=token, private_info = TRUE) > View(me[c('name', 'id', 'gender', 'birthday')]) This shows us a few fields from the data frame containing our personal details retrieved from Facebook. We use the View function which basically invokes a spreadsheet-style data viewer on R objects like data frames. Now, let us get information about our friends in our personal network. Do note that Facebook currently only lets you access information about those friends who have allowed access to the Graph API and hence you may not be able to get information pertaining to all friends in your friend list. We have anonymized their names below for privacy reasons. anonymous_names <- c('Johnny Juel', 'Houston Tancredi',..., 'Julius Henrichs', 'Yong Sprayberry') # getting friends information friends <- getFriends(token, simplify=TRUE) friends$name <- anonymous_names # view top few rows > View(head(friends)) This gives us a peek at some people from our list of friends which we just retrieved from Facebook. Let’s now analyze some descriptive statistics based on personal information regarding our friends like where they are from, their gender and so on. # get personal information friends_info <- getUsers(friends$id, token, private_info = TRUE) # get the gender of your friends >View(table(friends_info$gender)) This gives us the gender of my friends, looks like more male friends have authorized access to the Graph API in my network! # get the location of your friends >View(table(friends_info$location)) This depicts the location of my friends (wherever available) in the following data frame. # get relationship status of your friends > View(table(friends_info$relationship_status)) From the statistics in the following table I can see that a lot of my friends have gotten married over the past couple of years. Boy that does make me feel old! Suppose I want to look at the relationship status of my friends grouped by gender, we can do the same using the following snippet. # get relationship status of friends grouped by gender View(table(friends_info$relationship_status, friends_info$gender)) The following table gives us the desired results and you can see the distribution of friends by their gender and relationship status. Summary This article has been proven very beneficial to know some basic analytics of social networks with the help of R. Moreover, you will also get to know the information regarding the packages that R use. Resources for Article: Further resources on this subject: How to integrate social media with your WordPress website [article] Social Media Insight Using Naive Bayes [article] Social Media in Magento [article]
Read more
  • 0
  • 0
  • 4530

article-image-the-us-department-of-commerce-wants-to-regulate-export-of-ai-and-related-products
Prasad Ramesh
21 Nov 2018
4 min read
Save for later

The US Department of Commerce wants to regulate export of AI and related products

Prasad Ramesh
21 Nov 2018
4 min read
This Monday the Department of Commerce, Bureau of Industry and Security (BIS) published a proposal to control the export of AI from USA. This move seems to lean towards restricting AI tech going out of the country to protect the national security of USA. The areas that come under the licensing proposal Artificial intelligence, as we’ve seen in recent years has great potential for both good and harm. The DoC in the United States of America is not taking any chances with it. The proposal lists many areas of AI that could potentially require a license to be exported to certain countries. Other than computer vision, natural language processing, military-specific products like adaptive camouflage and faceprint for surveillance is also listed in the proposal to restrict the export of AI. The areas major areas listed in the proposal are: Biotechnology including genomic and genetic engineering Artificial intelligence (AI) and machine learning including neural networks, computer vision, and natural language processing Position, Navigation, and Timing (PNT) technology Microprocessor technology like stacked memory on chip Advanced computing technology like memory-centric logic Data analytics technology like data analytics by visualization and analysis algorithms Quantum information and sensing technology like quantum computing, encryption, and sensing Logistics technology like mobile electric power Additive manufacturing like 3D printing Robotics like micro drones and molecular robotics Brain-computer interfaces like mind-machine interfaces Hypersonics like flight control algorithms Advanced Materials like adaptive camouflage Advanced surveillance technologies faceprint and voiceprint technologies David Edelman, a former adviser to ex-US president Barack Obama said: “This is intended to be a shot across the bow, directed specifically at Beijing, in an attempt to flex their muscles on just how broad these restrictions could be”. Countries that could be affected with regulation on export of AI To determine the level of export controls, the department will consider the potential end-uses and end-users of the technology. The list of countries is not clear but ones to which exports are restricted like embargoed countries will be considered. Also, China could be one of them. What does this mean for companies? If your organization creates products in ‘emerging technologies’ then there will be restrictions on the countries you can export to and also on disclosure of technology to foreign nationals in United States. Depending on the criteria, non-US citizens might even need licenses to participate in research and development of such technology. This will restrict non-US citizens to participate and take back anything from, say an advanced AI research project. If the new regulations go into effect, it will affect the security review of foreign investments across these areas. When the list of technologies is finalized, many types of foreign investments will be subject to a review and deals could be halted or undone. Public views on academic research In addition to commercial applications and products, this regulation could also be bad news for academic research. https://twitter.com/jordanbharrod/status/1065047269282627584 https://twitter.com/BryanAlexander/status/1064941028795400193 Even Google Home, Amazon Alexa, iRobot Roomba could be affected. https://twitter.com/R_D/status/1064511113956655105 But it does not look like research papers will be really affected. The document states that the commerce does not intend to expand jurisdiction on ‘fundamental research’ for ‘emerging technologies’ that is intended to be published and not currently subject to EAR as per § 734.8. But will this affect open-source technologies? We really hope not. Deadline for comments is less than 30 days away BIS has invited comments to the proposal for defining and categorizing emerging technologies, the impact of the controls in US technology leadership among other topics. However the short deadline of December 19, 2018 indicates their haste to implement licensing export of AI quickly. For more details, and to know where you can submit your comments, read the proposal. The US Air Force lays groundwork towards artificial general intelligence based on hierarchical model of intelligence Google open sources BERT, an NLP pre-training technique Teaching AI ethics – Trick or Treat?
Read more
  • 0
  • 0
  • 4523
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-moving-further-numpy-modules
Packt
23 Jun 2015
23 min read
Save for later

Moving Further with NumPy Modules

Packt
23 Jun 2015
23 min read
NumPy has a number of modules inherited from its predecessor, Numeric. Some of these packages have a SciPy counterpart, which may have fuller functionality. In this article by Ivan Idris author of the book NumPy: Beginner's Guide - Third Edition we will cover the following topics: The linalg package The fft package Random numbers Continuous and discrete distributions (For more resources related to this topic, see here.) Linear algebra Linear algebra is an important branch of mathematics. The numpy.linalg package contains linear algebra functions. With this module, you can invert matrices, calculate eigenvalues, solve linear equations, and determine determinants, among other things (see http://docs.scipy.org/doc/numpy/reference/routines.linalg.html). Time for action – inverting matrices The inverse of a matrix A in linear algebra is the matrix A-1, which, when multiplied with the original matrix, is equal to the identity matrix I. This can be written as follows: A A-1 = I The inv() function in the numpy.linalg package can invert an example matrix with the following steps: Create the example matrix with the mat() function: A = np.mat("0 1 2;1 0 3;4 -3 8") print("An", A) The A matrix appears as follows: A [[ 0 1 2] [ 1 0 3] [ 4 -3 8]] Invert the matrix with the inv() function: inverse = np.linalg.inv(A) print("inverse of An", inverse) The inverse matrix appears as follows: inverse of A [[-4.5 7. -1.5] [-2.   4. -1. ] [ 1.5 -2.   0.5]] If the matrix is singular, or not square, a LinAlgError is raised. If you want, you can check the result manually with a pen and paper. This is left as an exercise for the reader. Check the result by multiplying the original matrix with the result of the inv() function: print("Checkn", A * inverse) The result is the identity matrix, as expected: Check [[ 1. 0. 0.] [ 0. 1. 0.] [ 0. 0. 1.]] What just happened? We calculated the inverse of a matrix with the inv() function of the numpy.linalg package. We checked, with matrix multiplication, whether this is indeed the inverse matrix (see inversion.py): from __future__ import print_function import numpy as np   A = np.mat("0 1 2;1 0 3;4 -3 8") print("An", A)   inverse = np.linalg.inv(A) print("inverse of An", inverse)   print("Checkn", A * inverse) Pop quiz – creating a matrix Q1. Which function can create matrices? array create_matrix mat vector Have a go hero – inverting your own matrix Create your own matrix and invert it. The inverse is only defined for square matrices. The matrix must be square and invertible; otherwise, a LinAlgError exception is raised. Solving linear systems A matrix transforms a vector into another vector in a linear way. This transformation mathematically corresponds to a system of linear equations. The numpy.linalg function solve() solves systems of linear equations of the form Ax = b, where A is a matrix, b can be a one-dimensional or two-dimensional array, and x is an unknown variable. We will see the dot() function in action. This function returns the dot product of two floating-point arrays. The dot() function calculates the dot product (see https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces/dot_cross_products/v/vector-dot-product-and-vector-length). For a matrix A and vector b, the dot product is equal to the following sum: Time for action – solving a linear system Solve an example of a linear system with the following steps: Create A and b: A = np.mat("1 -2 1;0 2 -8;-4 5 9") print("An", A) b = np.array([0, 8, -9]) print("bn", b) A and b appear as follows: Solve this linear system with the solve() function: x = np.linalg.solve(A, b) print("Solution", x) The solution of the linear system is as follows: Solution [ 29. 16.   3.] Check whether the solution is correct with the dot() function: print("Checkn", np.dot(A , x)) The result is as expected: Check [[ 0. 8. -9.]] What just happened? We solved a linear system using the solve() function from the NumPy linalg module and checked the solution with the dot() function: from __future__ import print_function import numpy as np   A = np.mat("1 -2 1;0 2 -8;-4 5 9") print("An", A)   b = np.array([0, 8, -9]) print("bn", b)   x = np.linalg.solve(A, b) print("Solution", x)   print("Checkn", np.dot(A , x)) Finding eigenvalues and eigenvectors Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector. Eigenvectors are vectors corresponding to eigenvalues (see https://www.khanacademy.org/math/linear-algebra/alternate_bases/eigen_everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors). The eigvals() function in the numpy.linalg package calculates eigenvalues. The eig() function returns a tuple containing eigenvalues and eigenvectors. Time for action – determining eigenvalues and eigenvectors Let's calculate the eigenvalues of a matrix: Create a matrix as shown in the following: A = np.mat("3 -2;1 0") print("An", A) The matrix we created looks like the following: A [[ 3 -2] [ 1 0]] Call the eigvals() function: print("Eigenvalues", np.linalg.eigvals(A)) The eigenvalues of the matrix are as follows: Eigenvalues [ 2. 1.] Determine eigenvalues and eigenvectors with the eig() function. This function returns a tuple, where the first element contains eigenvalues and the second element contains corresponding eigenvectors, arranged column-wise: eigenvalues, eigenvectors = np.linalg.eig(A) print("First tuple of eig", eigenvalues) print("Second tuple of eign", eigenvectors) The eigenvalues and eigenvectors appear as follows: First tuple of eig [ 2. 1.] Second tuple of eig [[ 0.89442719 0.70710678] [ 0.4472136   0.70710678]] Check the result with the dot() function by calculating the right and left side of the eigenvalues equation Ax = ax: for i, eigenvalue in enumerate(eigenvalues):      print("Left", np.dot(A, eigenvectors[:,i]))      print("Right", eigenvalue * eigenvectors[:,i])      print() The output is as follows: Left [[ 1.78885438] [ 0.89442719]] Right [[ 1.78885438] [ 0.89442719]] What just happened? We found the eigenvalues and eigenvectors of a matrix with the eigvals() and eig() functions of the numpy.linalg module. We checked the result using the dot() function (see eigenvalues.py): from __future__ import print_function import numpy as np   A = np.mat("3 -2;1 0") print("An", A)   print("Eigenvalues", np.linalg.eigvals(A) )   eigenvalues, eigenvectors = np.linalg.eig(A) print("First tuple of eig", eigenvalues) print("Second tuple of eign", eigenvectors)   for i, eigenvalue in enumerate(eigenvalues):      print("Left", np.dot(A, eigenvectors[:,i]))      print("Right", eigenvalue * eigenvectors[:,i])      print() Singular value decomposition Singular value decomposition (SVD) is a type of factorization that decomposes a matrix into a product of three matrices. The SVD is a generalization of the previously discussed eigenvalue decomposition. SVD is very useful for algorithms such as the pseudo inverse, which we will discuss in the next section. The svd() function in the numpy.linalg package can perform this decomposition. This function returns three matrices U, ?, and V such that U and V are unitary and ? contains the singular values of the input matrix: The asterisk denotes the Hermitian conjugate or the conjugate transpose. The complex conjugate changes the sign of the imaginary part of a complex number and is therefore not relevant for real numbers. A complex square matrix A is unitary if A*A = AA* = I (the identity matrix). We can interpret SVD as a sequence of three operations—rotation, scaling, and another rotation. We already transposed matrices in this article. The transpose flips matrices, turning rows into columns, and columns into rows. Time for action – decomposing a matrix It's time to decompose a matrix with the SVD using the following steps: First, create a matrix as shown in the following: A = np.mat("4 11 14;8 7 -2") print("An", A) The matrix we created looks like the following: A [[ 4 11 14] [ 8 7 -2]] Decompose the matrix with the svd() function: U, Sigma, V = np.linalg.svd(A, full_matrices=False) print("U") print(U) print("Sigma") print(Sigma) print("V") print(V) Because of the full_matrices=False specification, NumPy performs a reduced SVD decomposition, which is faster to compute. The result is a tuple containing the two unitary matrices U and V on the left and right, respectively, and the singular values of the middle matrix: U [[-0.9486833 -0.31622777]   [-0.31622777 0.9486833 ]] Sigma [ 18.97366596   9.48683298] V [[-0.33333333 -0.66666667 -0.66666667] [ 0.66666667 0.33333333 -0.66666667]] We do not actually have the middle matrix—we only have the diagonal values. The other values are all 0. Form the middle matrix with the diag() function. Multiply the three matrices as follows: print("Productn", U * np.diag(Sigma) * V) The product of the three matrices is equal to the matrix we created in the first step: Product [[ 4. 11. 14.] [ 8.   7. -2.]] What just happened? We decomposed a matrix and checked the result by matrix multiplication. We used the svd() function from the NumPy linalg module (see decomposition.py): from __future__ import print_function import numpy as np   A = np.mat("4 11 14;8 7 -2") print("An", A)   U, Sigma, V = np.linalg.svd(A, full_matrices=False)   print("U") print(U)   print("Sigma") print(Sigma)   print("V") print(V)   print("Productn", U * np.diag(Sigma) * V) Pseudo inverse The Moore-Penrose pseudo inverse of a matrix can be computed with the pinv() function of the numpy.linalg module (see http://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse). The pseudo inverse is calculated using the SVD (see previous example). The inv() function only accepts square matrices; the pinv() function does not have this restriction and is therefore considered a generalization of the inverse. Time for action – computing the pseudo inverse of a matrix Let's compute the pseudo inverse of a matrix: First, create a matrix: A = np.mat("4 11 14;8 7 -2") print("An", A) The matrix we created looks like the following: A [[ 4 11 14] [ 8 7 -2]] Calculate the pseudo inverse matrix with the pinv() function: pseudoinv = np.linalg.pinv(A) print("Pseudo inversen", pseudoinv) The pseudo inverse result is as follows: Pseudo inverse [[-0.00555556 0.07222222] [ 0.02222222 0.04444444] [ 0.05555556 -0.05555556]] Multiply the original and pseudo inverse matrices: print("Check", A * pseudoinv) What we get is not an identity matrix, but it comes close to it: Check [[ 1.00000000e+00   0.00000000e+00] [ 8.32667268e-17   1.00000000e+00]] What just happened? We computed the pseudo inverse of a matrix with the pinv() function of the numpy.linalg module. The check by matrix multiplication resulted in a matrix that is approximately an identity matrix (see pseudoinversion.py): from __future__ import print_function import numpy as np   A = np.mat("4 11 14;8 7 -2") print("An", A)   pseudoinv = np.linalg.pinv(A) print("Pseudo inversen", pseudoinv)   print("Check", A * pseudoinv) Determinants The determinant is a value associated with a square matrix. It is used throughout mathematics; for more details, please refer to http://en.wikipedia.org/wiki/Determinant. For a n x n real value matrix, the determinant corresponds to the scaling a n-dimensional volume undergoes when transformed by the matrix. The positive sign of the determinant means the volume preserves its orientation (clockwise or anticlockwise), while a negative sign means reversed orientation. The numpy.linalg module has a det() function that returns the determinant of a matrix. Time for action – calculating the determinant of a matrix To calculate the determinant of a matrix, follow these steps: Create the matrix: A = np.mat("3 4;5 6") print("An", A) The matrix we created appears as follows: A [[ 3. 4.] [ 5. 6.]] Compute the determinant with the det() function: print("Determinant", np.linalg.det(A)) The determinant appears as follows: Determinant -2.0 What just happened? We calculated the determinant of a matrix with the det() function from the numpy.linalg module (see determinant.py): from __future__ import print_function import numpy as np   A = np.mat("3 4;5 6") print("An", A)   print("Determinant", np.linalg.det(A)) Fast Fourier transform The Fast Fourier transform (FFT) is an efficient algorithm to calculate the discrete Fourier transform (DFT). The Fourier series represents a signal as a sum of sine and cosine terms. FFT improves on more naïve algorithms and is of order O(N log N). DFT has applications in signal processing, image processing, solving partial differential equations, and more. NumPy has a module called fft that offers FFT functionality. Many functions in this module are paired; for those functions, another function does the inverse operation. For instance, the fft() and ifft() function form such a pair. Time for action – calculating the Fourier transform First, we will create a signal to transform. Calculate the Fourier transform with the following steps: Create a cosine wave with 30 points as follows: x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) Transform the cosine wave with the fft() function: transformed = np.fft.fft(wave) Apply the inverse transform with the ifft() function. It should approximately return the original signal. Check with the following line: print(np.all(np.abs(np.fft.ifft(transformed) - wave)   < 10 ** -9)) The result appears as follows: True Plot the transformed signal with matplotlib: plt.plot(transformed) plt.title('Transformed cosine') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.show() The following resulting diagram shows the FFT result: What just happened? We applied the fft() function to a cosine wave. After applying the ifft() function, we got our signal back (see fourier.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) transformed = np.fft.fft(wave) print(np.all(np.abs(np.fft.ifft(transformed) - wave) < 10 ** -9))   plt.plot(transformed) plt.title('Transformed cosine') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.show() Shifting The fftshift() function of the numpy.linalg module shifts zero-frequency components to the center of a spectrum. The zero-frequency component corresponds to the mean of the signal. The ifftshift() function reverses this operation. Time for action – shifting frequencies We will create a signal, transform it, and then shift the signal. Shift the frequencies with the following steps: Create a cosine wave with 30 points: x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) Transform the cosine wave with the fft() function: transformed = np.fft.fft(wave) Shift the signal with the fftshift() function: shifted = np.fft.fftshift(transformed) Reverse the shift with the ifftshift() function. This should undo the shift. Check with the following code snippet: print(np.all((np.fft.ifftshift(shifted) - transformed)   < 10 ** -9)) The result appears as follows: True Plot the signal and transform it with matplotlib: plt.plot(transformed, lw=2, label="Transformed") plt.plot(shifted, '--', lw=3, label="Shifted") plt.title('Shifted and transformed cosine wave') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.legend(loc='best') plt.show() The following diagram shows the effect of the shift and the FFT: What just happened? We applied the fftshift() function to a cosine wave. After applying the ifftshift() function, we got our signal back (see fouriershift.py): import numpy as np import matplotlib.pyplot as plt     x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) transformed = np.fft.fft(wave) shifted = np.fft.fftshift(transformed) print(np.all(np.abs(np.fft.ifftshift(shifted) - transformed) < 10 ** -9))   plt.plot(transformed, lw=2, label="Transformed") plt.plot(shifted, '--', lw=3, label="Shifted") plt.title('Shifted and transformed cosine wave') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.legend(loc='best') plt.show() Random numbers Random numbers are used in Monte Carlo methods, stochastic calculus, and more. Real random numbers are hard to generate, so, in practice, we use pseudo random numbers, which are random enough for most intents and purposes, except for some very special cases. These numbers appear random, but if you analyze them more closely, you will realize that they follow a certain pattern. The random numbers-related functions are in the NumPy random module. The core random number generator is based on the Mersenne Twister algorithm—a standard and well-known algorithm (see https://en.wikipedia.org/wiki/Mersenne_Twister). We can generate random numbers from discrete or continuous distributions. The distribution functions have an optional size parameter, which tells NumPy how many numbers to generate. You can specify either an integer or a tuple as size. This will result in an array filled with random numbers of appropriate shape. Discrete distributions include the geometric, hypergeometric, and binomial distributions. Time for action – gambling with the binomial The binomial distribution models the number of successes in an integer number of independent trials of an experiment, where the probability of success in each experiment is a fixed number (see https://www.khanacademy.org/math/probability/random-variables-topic/binomial_distribution). Imagine a 17th century gambling house where you can bet on flipping pieces of eight. Nine coins are flipped. If less than five are heads, then you lose one piece of eight, otherwise you win one. Let's simulate this, starting with 1,000 coins in our possession. Use the binomial() function from the random module for that purpose. To understand the binomial() function, look at the following section: Initialize an array, which represents the cash balance, to zeros. Call the binomial() function with a size of 10000. This represents 10,000 coin flips in our casino: cash = np.zeros(10000) cash[0] = 1000 outcome = np.random.binomial(9, 0.5, size=len(cash)) Go through the outcomes of the coin flips and update the cash array. Print the minimum and maximum of the outcome, just to make sure we don't have any strange outliers: for i in range(1, len(cash)):    if outcome[i] < 5:      cash[i] = cash[i - 1] - 1    elif outcome[i] < 10:      cash[i] = cash[i - 1] + 1    else:      raise AssertionError("Unexpected outcome " + outcome)   print(outcome.min(), outcome.max()) As expected, the values are between 0 and 9. In the following diagram, you can see the cash balance performing a random walk: What just happened? We did a random walk experiment using the binomial() function from the NumPy random module (see headortail.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     cash = np.zeros(10000) cash[0] = 1000 np.random.seed(73) outcome = np.random.binomial(9, 0.5, size=len(cash))   for i in range(1, len(cash)):    if outcome[i] < 5:      cash[i] = cash[i - 1] - 1    elif outcome[i] < 10:      cash[i] = cash[i - 1] + 1    else:      raise AssertionError("Unexpected outcome " + outcome)   print(outcome.min(), outcome.max())   plt.plot(np.arange(len(cash)), cash) plt.title('Binomial simulation') plt.xlabel('# Bets') plt.ylabel('Cash') plt.grid() plt.show() Hypergeometric distribution The hypergeometricdistribution models a jar with two types of objects in it. The model tells us how many objects of one type we can get if we take a specified number of items out of the jar without replacing them (see https://en.wikipedia.org/wiki/Hypergeometric_distribution). The NumPy random module has a hypergeometric() function that simulates this situation. Time for action – simulating a game show Imagine a game show where every time the contestants answer a question correctly, they get to pull three balls from a jar and then put them back. Now, there is a catch, one ball in the jar is bad. Every time it is pulled out, the contestants lose six points. If, however, they manage to get out 3 of the 25 normal balls, they get one point. So, what is going to happen if we have 100 questions in total? Look at the following section for the solution: Initialize the outcome of the game with the hypergeometric() function. The first parameter of this function is the number of ways to make a good selection, the second parameter is the number of ways to make a bad selection, and the third parameter is the number of items sampled: points = np.zeros(100) outcomes = np.random.hypergeometric(25, 1, 3, size=len(points)) Set the scores based on the outcomes from the previous step: for i in range(len(points)):    if outcomes[i] == 3:      points[i] = points[i - 1] + 1    elif outcomes[i] == 2:      points[i] = points[i - 1] - 6    else:     print(outcomes[i]) The following diagram shows how the scoring evolved: What just happened? We simulated a game show using the hypergeometric() function from the NumPy random module. The game scoring depends on how many good and how many bad balls the contestants pulled out of a jar in each session (see urn.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     points = np.zeros(100) np.random.seed(16) outcomes = np.random.hypergeometric(25, 1, 3, size=len(points))   for i in range(len(points)):    if outcomes[i] == 3:      points[i] = points[i - 1] + 1    elif outcomes[i] == 2:      points[i] = points[i - 1] - 6    else:      print(outcomes[i])   plt.plot(np.arange(len(points)), points) plt.title('Game show simulation') plt.xlabel('# Rounds') plt.ylabel('Score') plt.grid() plt.show() Continuous distributions We usually model continuous distributions with probability density functions (PDF). The probability that a value is in a certain interval is determined by integration of the PDF (see https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/probability-density-functions). The NumPy random module has functions that represent continuous distributions—beta(), chisquare(), exponential(), f(), gamma(), gumbel(), laplace(), lognormal(), logistic(), multivariate_normal(), noncentral_chisquare(), noncentral_f(), normal(), and others. Time for action – drawing a normal distribution We can generate random numbers from a normal distribution and visualize their distribution with a histogram (see https://www.khanacademy.org/math/probability/statistics-inferential/normal_distribution/v/introduction-to-the-normal-distribution). Draw a normal distribution with the following steps: Generate random numbers for a given sample size using the normal() function from the random NumPy module: N=10000 normal_values = np.random.normal(size=N) Draw the histogram and theoretical PDF with a center value of 0 and standard deviation of 1. Use matplotlib for this purpose: _, bins, _ = plt.hist(normal_values,   np.sqrt(N), normed=True, lw=1) sigma = 1 mu = 0 plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi))   * np.exp( - (bins - mu)**2 / (2 * sigma**2) ),lw=2) plt.show() In the following diagram, we see the familiar bell curve: What just happened? We visualized the normal distribution using the normal() function from the random NumPy module. We did this by drawing the bell curve and a histogram of randomly generated values (see normaldist.py): import numpy as np import matplotlib.pyplot as plt   N=10000   np.random.seed(27) normal_values = np.random.normal(size=N) _, bins, _ = plt.hist(normal_values, np.sqrt(N), normed=True, lw=1, label="Histogram") sigma = 1 mu = 0 plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ), '--', lw=3, label="PDF") plt.title('Normal distribution') plt.xlabel('Value') plt.ylabel('Normalized Frequency') plt.grid() plt.legend(loc='best') plt.show() Lognormal distribution A lognormal distribution is a distribution of a random variable whose natural logarithm is normally distributed. The lognormal() function of the random NumPy module models this distribution. Time for action – drawing the lognormal distribution Let's visualize the lognormal distribution and its PDF with a histogram: Generate random numbers using the normal() function from the random NumPy module: N=10000 lognormal_values = np.random.lognormal(size=N) Draw the histogram and theoretical PDF with a center value of 0 and standard deviation of 1: _, bins, _ = plt.hist(lognormal_values,   np.sqrt(N), normed=True, lw=1) sigma = 1 mu = 0 x = np.linspace(min(bins), max(bins), len(bins)) pdf = np.exp(-(numpy.log(x) - mu)**2 / (2 * sigma**2))/ (x *   sigma * np.sqrt(2 * np.pi)) plt.plot(x, pdf,lw=3) plt.show() The fit of the histogram and theoretical PDF is excellent, as you can see in the following diagram: What just happened? We visualized the lognormal distribution using the lognormal() function from the random NumPy module. We did this by drawing the curve of the theoretical PDF and a histogram of randomly generated values (see lognormaldist.py): import numpy as np import matplotlib.pyplot as plt   N=10000 np.random.seed(34) lognormal_values = np.random.lognormal(size=N) _, bins, _ = plt.hist(lognormal_values,   np.sqrt(N), normed=True, lw=1, label="Histogram") sigma = 1 mu = 0 x = np.linspace(min(bins), max(bins), len(bins)) pdf = np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2))/ (x * sigma * np.sqrt(2 * np.pi)) plt.xlim([0, 15]) plt.plot(x, pdf,'--', lw=3, label="PDF") plt.title('Lognormal distribution') plt.xlabel('Value') plt.ylabel('Normalized frequency') plt.grid() plt.legend(loc='best') plt.show() Bootstrapping in statistics Bootstrapping is a method used to estimate variance, accuracy, and other metrics of sample estimates, such as the arithmetic mean. The simplest bootstrapping procedure consists of the following steps: Generate a large number of samples from the original data sample having the same size N. You can think of the original data as a jar containing numbers. We create the new samples by N times randomly picking a number from the jar. Each time we return the number into the jar, so a number can occur multiple times in a generated sample. With the new samples, we calculate the statistical estimate under investigation for each sample (for example, the arithmetic mean). This gives us a sample of possible values for the estimator. Time for action – sampling with numpy.random.choice() We will use the numpy.random.choice() function to perform bootstrapping. Start the IPython or Python shell and import NumPy: $ ipython In [1]: import numpy as np Generate a data sample following the normal distribution: In [2]: N = 500   In [3]: np.random.seed(52)   In [4]: data = np.random.normal(size=N)   Calculate the mean of the data: In [5]: data.mean() Out[5]: 0.07253250605445645 Generate 100 samples from the original data and calculate their means (of course, more samples may lead to a more accurate result): In [6]: bootstrapped = np.random.choice(data, size=(N, 100))   In [7]: means = bootstrapped.mean(axis=0)   In [8]: means.shape Out[8]: (100,) Calculate the mean, variance, and standard deviation of the arithmetic means we obtained: In [9]: means.mean() Out[9]: 0.067866373318115278   In [10]: means.var() Out[10]: 0.001762807104774598   In [11]: means.std() Out[11]: 0.041985796464692651 If we are assuming a normal distribution for the means, it may be relevant to know the z-score, which is defined as follows: In [12]: (data.mean() - means.mean())/means.std() Out[12]: 0.11113598238549766 From the z-score value, we get an idea of how probable the actual mean is. What just happened? We bootstrapped a data sample by generating samples and calculating the means of each sample. Then we computed the mean, standard deviation, variance, and z-score of the means. We used the numpy.random.choice() function for bootstrapping. Summary You learned a lot in this article about NumPy modules. We covered linear algebra, the Fast Fourier transform, continuous and discrete distributions, and random numbers. Resources for Article: Further resources on this subject: SciPy for Signal Processing [article] Visualization [article] The plot function [article]
Read more
  • 0
  • 0
  • 4499

article-image-getting-started-with-h2o-for-machine-learning
Sugandha Lahoti
01 Dec 2017
7 min read
Save for later

Getting started with Machine Learning in H2O

Sugandha Lahoti
01 Dec 2017
7 min read
[box type="note" align="" class="" width=""]We present to you an excerpt from our book by Dr. Uday Kamath and Krishna Choppella titled Mastering Java Machine Learning. This book aims to give you an array of advanced techniques on Machine Learning. [/box] Our article given below talks about using H2O as a Machine Learning Platform for Big Data applications. H2O is a leading open source platform for Machine Learning at Big Data scale, with a focus on bringing AI to the enterprise. The company counts several leading lights in statistical learning theory and optimization among its scientific advisors. It supports programming environments in multiple languages. H2O architecture The following figure gives a high-level architecture of H2O with important components. H2O can access data from various data stores such as HDFS, SQL, NoSQL, and Amazon S3, to name a few. The most popular deployment of H2O is to use one of the deployment stacks with Spark or to run it in a H2O cluster itself. The core of H2O is an optimized way of handling Big Data in memory, so that iterative algorithms that go through the same data can be handled efficiently and achieve good performance. Important Machine Learning algorithms in supervised and unsupervised learning are implemented specially to handle horizontal scalability across multiple nodes and JVMs. H2O provides not only its own user interface, known as flow, to manage and run modeling tasks, but also has different language bindings and connector APIs to Java, R, Python, and Scala. Most Machine Learning algorithms, optimization algorithms, and utilities use the concept of fork-join or MapReduce. As shown in the figure below, the entire dataset is considered as a Data Frame in H2O, and comprises vectors, which are features or columns in the dataset. The rows or instances are made up of one element from each Vector arranged side-by-side. The rows are grouped together to form a processing unit known as a Chunk. Several chunks are combined in one JVM. Any algorithmic or optimization work begins by sending the information from the topmost JVM to fork on to the next JVM, then on to the next, and so on, similar to the map operation in MapReduce. Each JVM works on the rows in the chunks to establish the task and finally the results flow back in the reduce operation: Machine learning in H2O The following figure shows all the Machine Learning algorithms supported in H2O v3 for supervised and unsupervised learning: Tools and usage H2O Flow is an interactive web application that helps data scientists to perform various tasks from importing data to running complex models using point and click and wizard-based concepts. H2O is run in local mode as: java –Xmx6g –jar h2o.jar The default way to start Flow is to point your browser and go to the following URL: http://192.168.1.7:54321/. The right-side of Flow captures every user action performed under the tab OUTLINE. The actions taken can be edited and saved as named flows for reuse and collaboration, as shown in the figure below: The figure below shows the interface for importing files from the local filesystem or HDFS and displays detailed summary statistics as well as next actions that can be performed on the dataset. Once the data is imported, it gets a data frame reference in the H2O framework with the extension of .hex. The summary statistics are useful in understanding the characteristics of data such as missing, mean, max, min, and so on. It also has an easy way to transform the features from one type to another, for example, numeric features with a few unique values to categorical/nominal types known as enum in H2O. The actions that can be performed on the datasets are: Visualize the data. Split the data into different sets such as training, validation, and testing. Build supervised and unsupervised models. Use the models to predict. Download and export the files in various formats. Building supervised or unsupervised models in H2O is done through an interactive screen. Every modeling algorithm has its parameters classified into three sections: basic, advanced, and expert. Any parameter that supports hyper-parameter searches for tuning the model has a checkbox grid next to it, and more than one parameter value can be used. Some basic parameters such as training_frame, validation_frame, and response_ column, are common to every supervised algorithm; others are specific to model types, such as the choice of solver for GLM, the activation function for deep learning, and so on. All such common parameters are available in the basic section. Advanced parameters are settings that afford greater flexibility and control to the modeler if the default behavior must be overridden. Several of these parameters are also common across some algorithms—two examples are the choice of method for assigning the fold index (if cross-validation was selected in the basic section), and selecting the column containing weights (if each example is weighted separately), and so on. Expert parameters define more complex elements such as how to handle the missing values, model-specific parameters that need more than a basic understanding of the algorithms, and other esoteric variables. In the figure below, GLM, a supervised learning algorithm, is being configured with 10-fold cross-validation, binomial (two-class) classification, efficient LBFGS optimization algorithm, and stratified sampling for cross-validation split: The model results screen contains a detailed analysis of the results using important evaluation charts, depending on the validation method that was used. At the top of the screen are possible actions that can be taken, such as to run the model on unseen data for prediction, download the model as POJO format, export the results, and so on. Some of the charts are algorithm-specific, like the scoring history that shows how the training loss or the objective function changes over the iterations in GLM—this gives the user insight into the speed of convergence as well as into the tuning of the iterations parameter. We see the ROC curves and the Area Under Curve metric on the validation data in addition to the gains and lift charts, which give the cumulative capture rate and cumulative lift over the validation sample respectively. The figure below shows SCORING HISTORY, ROC CURVE, and GAINS/LIFT charts for GLM on 10-fold cross-validation on the CoverType dataset: The output of validation gives detailed evaluation measures such as accuracy, AUC, err, errors, f1 measure, MCC (Mathews Correlation Coefficient), precision, and recall for each validation fold in the case of cross-validation as well as the mean and standard deviation computed across all. The prediction action runs the model using unseen held-out data to estimate the out-of-sample performance. Important measures such as errors, accuracy, area under curve, ROC plots, and so on, are given as the output of predictions that can be saved or exported. H2O is a rich visualization and analysis framework that is accessible from multiple programming environments( HDFS, SQL, NoSQL, S3, and others). It can also support a number of Machine Learning algorithms that can be run in a cluster. All these factors make it one of the major Machine Learning framework on Big Data. If you think this post is useful, do not miss to check our book Mastering Java Machine Learning  to know more on predictive models for batch- and stream-based big data learning using the latest tools and methodologies.  
Read more
  • 0
  • 0
  • 4471

article-image-integration-spark-sql
Packt
24 Sep 2015
11 min read
Save for later

Integration with Spark SQL

Packt
24 Sep 2015
11 min read
 In this article by Sumit Gupta, the author of the book Learning Real-time Processing with Spark Streaming, we will discuss the integration of Spark Streaming with various other advance Spark libraries such as Spark SQL. (For more resources related to this topic, see here.) No single software in today's world can fulfill the varied, versatile, and complex demands/needs of the enterprises, and to be honest, neither should it! Software are made to fulfill specific needs arising out of the enterprises at a particular point in time, which may change in future due to many other factors. These factors may or may not be controlled like government policies, business/market dynamics, and many more. Considering all these factors integration and interoperability of any software system with internal/external systems/software's is pivotal in fulfilling the enterprise needs. Integration and interoperability are categorized as nonfunctional requirements, which are always implicit and may or may not be explicitly stated by the end users. Over the period of time, architects have realized the importance of these implicit requirements in modern enterprises, and now, all enterprise architectures provide support due diligence and provisions in fulfillment of these requirements. Even the enterprise architecture frameworks such as The Open Group Architecture Framework (TOGAF) defines the specific set of procedures and guidelines for defining and establishing interoperability and integration requirements of modern enterprises. Spark community realized the importance of both these factors and provided a versatile and scalable framework with certain hooks for integration and interoperability with the different systems/libraries; for example; data consumed and processed via Spark streams can also be loaded into the structured (table: rows/columns) format and can be further queried using SQL. Even the data can be stored in the form of Hive tables in HDFS as persistent tables, which will exist even after our Spark program has restarted. In this article, we will discuss querying streaming data in real time using Spark SQL. Querying streaming data in real time Spark Streaming is developed on the principle of integration and interoperability where it not only provides a framework for consuming data in near real time from varied data sources, but at the same time, it also provides the integration with Spark SQL where existing DStreams can be converted into structured data format for querying using standard SQL constructs. There are many such use cases where SQL on streaming data is a much needed feature; for example, in our distributed log analysis use case, we may need to combine the precomputed datasets with the streaming data for performing exploratory analysis using interactive SQL queries, which is difficult to implement only with streaming operators as they are not designed for introducing new datasets and perform ad hoc queries. Moreover SQL's success at expressing complex data transformations derives from the fact that it is based on a set of very powerful data processing primitives that do filtering, merging, correlation, and aggregation, which is not available in the low-level programming languages such as Java/ C++ and may result in long development cycles and high maintenance costs. Let's move forward and first understand few things about Spark SQL, and then, we will also see the process of converting existing DStreams into the Structured formats. Understanding Spark SQL Spark SQL is one of the modules developed over the Spark framework for processing structured data, which is stored in the form of rows and columns. At a very high level, it is similar to the data residing in RDBMS in the form rows and columns, and then SQL queries are executed for performing analysis, but Spark SQL is much more versatile and flexible as compared to RDBMS. Spark SQL provides distributed processing of SQL queries and can be compared to frameworks Hive/Impala or Drill. Here are the few notable features of Spark SQL: Spark SQL is capable of loading data from variety of data sources such as text files, JSON, Hive, HDFS, Parquet format, and of course RDBMS too so that we can consume/join and process datasets from different and varied data sources. It supports static and dynamic schema definition for the data loaded from various sources, which helps in defining schema for known data structures/types, and also for those datasets where the columns and their types are not known until runtime. It can work as a distributed query engine using the thrift JDBC/ODBC server or command-line interface where end users or applications can interact with Spark SQL directly to run SQL queries. Spark SQL provides integration with Spark Streaming where DStreams can be transformed into the structured format and further SQL Queries can be executed. It is capable of caching tables using an in-memory columnar format for faster reads and in-memory data processing. It supports Schema evolution so that new columns can be added/deleted to the existing schema, and Spark SQL still maintains the compatibility between all versions of the schema. Spark SQL defines the higher level of programming abstraction called DataFrames, which is also an extension to the existing RDD API. Data frames are the distributed collection of the objects in the form of rows and named columns, which is similar to tables in the RDBMS, but with much richer functionality containing all the previously defined features. The DataFrame API is inspired by the concepts of data frames in R (http://www.r-tutor.com/r-introduction/data-frame) and Python (http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). Let's move ahead and understand how Spark SQL works with the help of an example: As a first step, let's create sample JSON data about the basic information about the company's departments such as Name, Employees, and so on, and save this data into the file company.json. The JSON file would look like this: [ { "Name":"DEPT_A", "No_Of_Emp":10, "No_Of_Supervisors":2 }, { "Name":"DEPT_B", "No_Of_Emp":12, "No_Of_Supervisors":2 }, { "Name":"DEPT_C", "No_Of_Emp":14, "No_Of_Supervisors":3 }, { "Name":"DEPT_D", "No_Of_Emp":10, "No_Of_Supervisors":1 }, { "Name":"DEPT_E", "No_Of_Emp":20, "No_Of_Supervisors":5 } ] You can use any online JSON editor such as http://codebeautify.org/online-json-editor to see and edit data defined in the preceding JSON code. Next, let's extend our Spark-Examples project and create a new package by the name chapter.six, and within this new package, create a new Scala object and name it as ScalaFirstSparkSQL.scala. Next, add the following import statements just below the package declaration: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql._ import org.apache.spark.sql.functions._ Further, in your main method, add following set of statements to create SQLContext from SparkContext: //Creating Spark Configuration val conf = new SparkConf() //Setting Application/ Job Name conf.setAppName("My First Spark SQL") // Define Spark Context which we will use to initialize our SQL Context val sparkCtx = new SparkContext(conf) //Creating SQL Context val sqlCtx = new SQLContext(sparkCtx) SQLContext or any of its descendants such as HiveContext—for working with Hive tables or CassandraSQLContext—for working with Cassandra tables is the main entry point for accessing all functionalities of Spark SQL. It allows the creation of data frames, and also provides functionality to fire SQL queries over data frames. Next, we will define the following code to load the JSON file (company.json) using the SQLContext, and further, we will also create a data frame: //Define path of your JSON File (company.json) which needs to be processed val path = "/home/softwares/spark/data/company.json"; //Use SQLCOntext and Load the JSON file. //This will return the DataFrame which can be further Queried using SQL queries. val dataFrame = sqlCtx.jsonFile(path) In the preceding piece of code, we used the jsonFile(…) method for loading the JSON data. There are other utility method defined by SQLContext for reading raw data from filesystem or creating data frames from the existing RDD and many more. Spark SQL supports two different methods for converting the existing RDDs into data frames. The first method uses reflection to infer the schema of an RDD from the given data. This approach leads to more concise code and helps in instances where we already know the schema while writing Spark application. We have used the same approach in our example. The second method is through a programmatic interface that allows to construct a schema. Then, apply it to an existing RDD and finally generate a data frame. This method is more verbose, but provides flexibility and helps in those instances where columns and data types are not known until the data is received at runtime. Refer to https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.SQLContext for a complete list of methods exposed by SQLContext. Once the DataFrame is created, we need to register DataFrame as a temporary table within the SQL context so that we can execute SQL queries over the registered table. Let's add the following piece of code for registering our DataFrame with our SQL context and name it company: //Register the data as a temporary table within SQL Context //Temporary table is destroyed as soon as SQL Context is destroyed. dataFrame.registerTempTable("company"); And we are done… Our JSON data is automatically organized into the table (rows/column) and is ready to accept the SQL queries. Even the data types is inferred from the type of data entered within the JSON file itself. Now, we will start executing the SQL queries on our table, but before that let's see the schema being created/defined by SQLContext: //Printing the Schema of the Data loaded in the Data Frame dataFrame.printSchema(); The execution of the preceding statement will provide results similar to mentioned illustration: The preceding illustration shows the schema of the JSON data loaded by Spark SQL. Pretty simple and straight, isn't it? Spark SQL has automatically created our schema based on the data defined in our company.json file. It has also defined the data type of each of the columns. We can also define the schema using reflection (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection) or can also programmatically define the schema (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection). Next, let's execute some SQL queries to see the data stored in DataFrame, so the first SQL would be to print all records: //Executing SQL Queries to Print all records in the DataFrame println("Printing All records") sqlCtx.sql("Select * from company").collect().foreach(print) The execution of the preceding statement will produce the following results on the console where the driver is executed: Next, let's also select only few columns instead of all records and print the same on console: //Executing SQL Queries to Print Name and Employees //in each Department println("n Printing Number of Employees in All Departments") sqlCtx.sql("Select Name, No_Of_Emp from company").collect().foreach(println) The execution of the preceding statement will produce the following results on the Console where the driver is executed: Now, finally let's do some aggregation and count the total number of all employees across the departments: //Using the aggregate function (agg) to print the //total number of employees in the Company println("n Printing Total Number of Employees in Company_X") val allRec = sqlCtx.sql("Select * from company").agg(Map("No_Of_Emp"->"sum")) allRec.collect.foreach ( println ) In the preceding piece of code, we used the agg(…) function and performed the sum of all employees across the departments, where sum can be replaced by avg, max, min, or count. The execution of the preceding statement will produce the following results on the console where the driver is executed: The preceding images shows the results of executing the aggregation on our company.json data. Refer to the Data Frame API at https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.DataFrame for further information on the available functions for performing aggregation. As a last step, we will stop our Spark SQL context by invoking the stop() function on SparkContext—sparkCtx.stop(). This is required so that your application can notify master or resource manager to release all resources allocated to the Spark job. It also ensures the graceful shutdown of the job and avoids any resource leakage, which may happen otherwise. Also, as of now, there can be only one Spark context active per JVM, and we need to stop() the active SparkContext class before creating a new one. Summary In this article, we have seen the step-by-step process of using Spark SQL as a standalone program. Though we have considered JSON files as an example, but we can also leverage Spark SQL with Cassandra (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md) or MongoDB (https://github.com/Stratio/spark-mongodb) or Elasticsearch (http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html). Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames[article] Sabermetrics with Apache Spark[article] Getting Started with Apache Spark [article]
Read more
  • 0
  • 0
  • 4455

article-image-integrating-elasticsearch-hadoop-ecosystem
Packt
07 Oct 2015
14 min read
Save for later

Integrating Elasticsearch with the Hadoop ecosystem

Packt
07 Oct 2015
14 min read
In this article by Vishal Shukla, author of the book Elasticsearch for Hadoop, we will take a look at how ES-Hadoop can integrate with Pig and Spark with ease. Elasticsearch is great in getting insights into the indexed data. The Hadoop ecosystem does a great job in making Hadoop easily usable for different users by providing a comfortable interface. Some of the examples are Hive and Pig. Apart from these, Hadoop integrates well with other computing engines and platforms, such as Spark and Cascading. (For more resources related to this topic, see here.) Pigging out Elasticsearch For many use cases, Pig is one of the easiest ways to fiddle around with the data in the Hadoop ecosystem. Pig wins when it comes to ease of use and simple syntax for designing data flow pipelines without getting into complex programming. Assuming that you know Pig, we will cover how to move the data to and from Elasticsearch. If you don't know Pig yet, never mind. You can still carry on with the steps, and by the end of the article, you will at least know how to use Pig to perform data ingestion and reading with Elasticsearch. Setting up Apache Pig for Elasticsearch Let's start by setting up Apache Pig. At the time of writing this article, the latest Pig version available is 0.15.0. You can use the following steps to set up the same version: First, download the Pig distribution using the following command: $ sudo wget –O /usr/local/pig.tar.gz http://mirrors.sonic.net/apache/pig/pig-0.15.0/pig-0.15.0.tar.gz Then, extract Pig to the desired location and rename it to a convenient name: $ cd /userusr/local $ sudo tar –xvf pig.tar.gz $ sudo mv pig-0.15.0 pig Now, export the required environment variables by appending the following two lines in the /home/eshadoop/.bashrc file: export PIG_HOME=/usr/local/pig export PATH=$PATH:$PIG_HOME/bin You can either log out and relogin to see the newly set environment variables or source the environment configuration with the following command: $ source ~/.bashrc Now, start the job history server daemon with the following command: $ mr-jobhistory-daemon.sh start historyserver You should see the Pig console with the following command: $ pig grunt> It's easy to forget to start the job history daemon once you restart your machine or VM. You may make this daemon run on start up, or you need to ensure this manually. Now, we have Pig up and running. In order to use Pig with Elasticsearch, we must ensure that the ES-Hadoop JAR file is available in the Pig classpath. Let's take the ES-Hadoop JAR file and and import it to HDFS using the following steps: First, download the ES-Hadoop JAR used to develop the examples in this article, as shown in the following command: $ wget http://central.maven.org/maven2/org/elasticsearch/elasticsearch-hadoop/2.1.1/elasticsearch-hadoop-2.1.1.jar Then, move the downloaded JAR to a convenient name as follows: $ sudo mkdir /opt/lib Now, import the JAR to HDFS: $ hadoop fs –mkdir /lib $ hadoop fs –put elasticsearch-hadoop-2.1.1.jar /lib/elasticsearch-hadoop-2.1.1.jar Throughout this article, we will use a crime dataset that is tailored from the open dataset provided at https://data.cityofchicago.org/. This tailored dataset can be downloaded from http://www.packtpub.com/support, where all the code files required for this article are available. Once you have downloaded the dataset, import it to HDFS at /ch07/crime_data.csv. Importing data to Elasticsearch Let's import the crime dataset to Elasticsearch using Pig with ES-Hadoop. This provides the EsStorage class as Pig Storage. In order to use the EsStorage class, you need to have a registered ES-Hadoop JAR with Pig. You can register the JAR located in the local filesystem, HDFS, or other shared filesystems. The REGISTER command registers a JAR file that contains UDFs (User-defined functions) with Pig, as shown in the following code: grunt> REGISTER hdfs://localhost:9000/lib/elasticsearch-hadoop-2.1.1.jar; Then, load the CSV data file as a relation with the following code: grunt> SOURCE = load '/ch07/crimes_dataset.csv' using PigStorage(',') as (id:chararray, caseNumber:chararray, date:datetime, block:chararray, iucr:chararray, primaryType:chararray, description:chararray, location:chararray, arrest:boolean, domestic:boolean, lat:double,lon:double); This command reads the CSV fields and maps each token in the data to the respective field in the preceding command. The resulting relation, SOURCE, represents a relation with the Bag data structure that contains multiple Tuples. Now, generate the target Pig relation that has the structure that matches closely to the target Elasticsearch index mapping, as shown in the following code: grunt> TARGET = foreach SOURCE generate id, caseNumber, date, block, iucr, primaryType, description, location, arrest, domestic, TOTUPLE(lon, lat) AS geoLocation; Here, we need the nested object with the geoLocation name in the target Elasticsearch document. We can achieve this with a Tuple to represent the lat and lon fields. TOTUPLE() helps us to create this tuple. We then assigned the geoLocation alias for this tuple. Let's store the TARGET relationto the Elasticsearch index with the following code: grunt> STORE TARGET INTO 'esh_pig/crimes' USING org.elasticsearch.hadoop.pig.EsStorage('es.http.timeout = 5m', 'es.index.auto.create = true', 'es.mapping.names=arrest:isArrest, domestic:isDomestic', 'es.mapping.id=id'); We can specify the target index and type to store indexed documents. The EsStorage class can accept multiple Elasticsearch configurations.es.mapping.names maps the Pig field name to Elasticsearch document's field name. You can use Pig's field id to assign a custom _id value for the Elasticsearch document using the es.mapping.id option. Similarly, you can set the _ttl and _timestamp metadata fields as well. Pig uses just one reducer in the default configuration. It is recommended to change this behavior to have a parallelism that matches the number of shards available, as shown in the following command: grunt> SET default_parallel 5; Pig also combines the input splits, irrespective of its size. This makes it efficient for small files by reducing the number of mappers. However, this will give performance issues for large files. You can disable this behavior in the Pig script, as shown in the following command: grunt> SET pig.splitCombination FALSE; Executing the preceding commands will create the Elasticsearch index and import crime data documents. If you observe the created documents in Elasticsearch, you can see the geoLocation value isan array in the [-87.74274476, 41.87404405]format. This is because by default, ES-Hadoop ignores the tuple field names and simply converts them as an ordered array. If you wish to make your geoLocation field look similar to the key/value-based object with the lat/lon keys, you can do so by including the following configuration in EsStorage: es.mapping.pig.tuple.use.field.names=true Writing from the JSON source If you have inputs as a well-formed JSON file, you can avoid conversion and transformations and directly pass the JSON document to Elasticsearch for indexing purposes. You may have the JSON data in Pig as chararray, bytearray, or in any other form that translates to well-formed JSON by calling the toString() method, as shown in the following code: grunt> JSON_DATA = LOAD '/ch07/crimes.json' USING PigStorage() AS (json:chararray); grunt> STORE JSON_DATA INTO 'esh_pig/crimes_json' USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true'); Type conversions Take a look at the the type mapping of the esh_pig index in Elasticsearch. It maps the geoLocation type to double. This is done because Elasticsearch inferred the double type based on the field type we specified in Pig. To map geoLocation to geo_point, you must create the Elasticsearch mapping for it manually before executing the script. Although Elasticsearch provides a data type detection based on the type of field in the incoming document, it is always good to create the type mapping beforehand in Elasticsearch. This is a one-time activity that you should do. Then, you can run the MapReduce, Pig, Hive, Cascading, or Spark jobs multiple times. This will avoid any surprises in the type detection. For your reference, here is a list of some of the field types of Pig and Elasticsearch that map to each other. The table doesn't list no-brainer and absolutely intuitive type mappings: Pig type Elasticsearch type chararray This specifies string bytearray This indicates binary tuple This denotes an array(default) or object bag This specifies an array map This denotes an object bigdecimal This indicates Not supported biginteger This denotes Not supported Reading data from Elasticsearch Reading data from Elasticsearch using Pig is as simple as writing a single command with the Elasticsearch query. Here is a snippet of how to print tuples that has crimes related to theft: grunt> REGISTER hdfs://localhost:9000/lib/elasticsearch-hadoop-2.1.1.jar grunt> ES = LOAD 'esh_pig/crimes' using org.elasticsearch.hadoop.pig.EsStorage('{"query" : { "term" : { "primaryType" : "theft" } } }'); grunt> dump ES; Executing the preceding commands will print the tuples Pig console. Giving Spark to Elasticsearch Spark is a distributed computing system that provides huge performance boost compared to Hadoop MapReduce. It works on an abstraction of RDD (Resilient-distributed Datasets). This can be created for any data residing in Hadoop. Without any surprises, ES-Hadoop provides easy integration with Spark by enabling the creation of RDD from the data in Elasticsearch. Spark's increasing support of integrating with various data sources, such as HDFS, Parquet, Avro, S3, Cassandra, relational databases, and streaming data makes it special when it comes to data integration. This means that when you use ES-Hadoop with Spark, you can make all these sources integrate with Elasticsearch easily. Setting up Spark In order to set up Apache Spark in order to execute a job, you can perform the following steps: First, download the Apache Spark distribution with the following command: $ sudo wget –O /usr/local/spark.tgzhttp://www.apache.org/dyn/closer.cgi/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz Then, extract Spark to the desired location and rename it to a convenient name, as shown in the following command: $ cd /user/local $ sudo tar –xvf spark.tgz $ sudo mv spark-1.4.1-bin-hadoop2.4 spark Importing data to Elasticsearch To import the crime dataset to Elasticsearch with Spark, let's see how we can write a Spark job. We will continue using Java to write Spark jobs for consistency. Here are the driver program's snippets: SparkConf conf = new SparkConf().setAppName("esh-spark").setMaster("local[4]"); conf.set("es.index.auto.create", "true"); JavaSparkContext context = new JavaSparkContext(conf); Set up the SparkConf object to configure the spark job. As always, you can also set most options (such as es.index.auto.create) and other configurations that we have seen throughout the article. Using this configuration, we created the JavaSparkContext object as follows: JavaRDD<String> textFile = context.textFile("hdfs://localhost:9000/ch07/crimes_dataset.csv"); Read the crime data CSV file as JavaRDD. Here, RDD is still of the type String that represents each line: JavaRDD<Crime> dataSplits = textFile.map(new Function<String, Crime>() { @Override public Crime call(String line) throws Exception { CSVParser parser = CSVParser.parse(line, CSVFormat.RFC4180); Crime c = new Crime(); CSVRecord record = parser.getRecords().get(0); c.setId(record.get(0)); .. .. String lat = record.get(10); String lon = record.get(11); Map<String, Double> geoLocation = new HashMap<>(); geoLocation.put("lat", StringUtils.isEmpty(lat)? null:Double.parseDouble(lat)); geoLocation.put("lon",StringUtils.isEmpty(lon)?null:Double. parseDouble(lon)); c.setGeoLocation(geoLocation); return c; } }); In the preceding snippet, we called the map() method on JavaRDD to map each of the input line to the Crime object. Note that we created a simple JavaBean class called Crime that implements the Serializable interface and maps to the Elasticsearch document structure. Using CSVParser, we parsed each field into the Crime object. We mapped nested the geoLocation object by embedding Map in the Crime object. This map is populated with the lat and lon fields. This map() method returns another JavaRDD that contains the Crime objects, as shown in the following code: JavaEsSpark.saveToEs(dataSplits, "esh_spark/crimes"); Save JavaRDD<Crime> to Elasticsearch with the JavaEsSpark class provided by Elasticsearch. For all the ES-Hadoop integrations, such as Pig, Hive, Cascading, Apache Storm, and Spark, you can use all the standard ES-Hadoop configurations and techniques. This includes dynamic/multiresource writes with a pattern similar to esh_spark/{primaryType} and use JSON strings to directly import the data to Elasticsearch as well. To control the Elasticsearch document metadata from being indexed, you can use the saveToEsWithMeta() method of JavaEsSpark. You can pass an instance of JavaPairRDD that contains Tuple2<Metadata, Object>, where Metadata represents a map that has the key/value pairs of the document metadata fields, such as id, ttl, timestamp, and version. Using SparkSQL ES-Hadoop also bridges Elasticsearch with the SparkSQL module. SparkSQL 1.3+ versions provide the DataFrame abstraction that represent a collection of Row. We will not discuss the details of DataFrame here. ES-Hadoop lets you persist your DataFrame instance to Elasticsearch transparently. Let's see how we can do this with the following code: SQLContext sqlContext = new SQLContext(context); DataFrame df = sqlContext.createDataFrame(dataSplits, Crime.class); Create an SQLContext instance using the JavaSparkContext instance. Using the SqlContextSqlContext instance, you can create DataFrame by calling the createDataFrame() method and passing the existing JavaRDD<T> and Class<T>, where T is a JavaBean class that implements the Serializable interface. Note that the passing class instance is required to infer a schema for DataFrame. If you wish to use nonJavaBean-based RDD, you can create the schema manually. The article source code contains the implementations of both the approaches for your reference. Take a look at the following code: JavaEsSparkSQL.saveToEs(df, "esh_sparksql/crimes_reflection"); Once you have the DataFrame instance, you can save it to Elasticsearch with the JavaEsSparkSQL class, as shown in the preceding code. Reading data from Elasticsearch Here is the snippet of SparkEsReader that finds crimes related to theft: JavaRDD<Map<String, Object>> esRDD = JavaEsSpark.esRDD(context, "esh_spark/crimes", "{"query" : { "term" : { "primaryType" : "theft" } } }").values(); for(Map<String,Object> item: esRDD.collect()){ System.out.println(item); } We used the same JavaEsSpark class to create RDD with documents that match the Elasticsearch query. Using SparkSQL ES-Hadoop provides a org.elasticsearch.spark.sql data source provider to read the data from Elasticsearch using SparkSQL, as shown in the following code: Map<String, String> options = new HashMap<>(); options.put("pushdown","true"); options.put("es.nodes","localhost"); DataFrame df = sqlContext.read() .options(options) .format("org.elasticsearch.spark.sql") .load("esh_sparksql/crimes_reflection"); The preceding code snippet uses the org.elasticsearch.spark.sql data source to load data from Elasticsearch. You can set the pushdown option to true to push the query execution down to Elasticsearch. This greatly increases its efficiency as the query execution is collocated where the data resides, as shown in the following code: df.registerTempTable("crimes"); DataFrame theftCrimes = sqlContext.sql("SELECT * FROM crimes WHERE primaryType='THEFT'"); for(Row row: theftCrimes.javaRDD().collect()){ System.out.println(row); } We registered table with the data frame and executed the SQL query on SqlContext. Note that we need to collect the final results locally to print in a driver class. Summary In this article, we looked at the various Hadoop ecosystem technologies. We set up Pig with ES-Hadoop and developed the script to interact with Elasticsearch. You also learned how to use ES-Hadoop to integrate Elasticsearch with Spark and empower it with powerful SQL engine SparkSQL. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [Article] Elasticsearch Administration [Article] Downloading and Setting Up ElasticSearch [Article]
Read more
  • 0
  • 0
  • 4453
article-image-what-are-ssas-2012-dimensions-and-cube
Packt
29 Aug 2013
7 min read
Save for later

What are SSAS 2012 dimensions and cube?

Packt
29 Aug 2013
7 min read
(For more resources related to this topic, see here.) What is SSAS? SQL Server Analysis Services is an online analytical processing tool that highly boosts the different types of SQL queries and calculations that are accepted in the business intelligence environment. It looks like a relation database, but it has differences. SSAS does not replace the requirement of relational databases, but if you combine the two, it would help to develop the business intelligence solutions. Why do we need SSAS? SSAS provide a very clear graphical interface for the end users to build queries. It is a kind of cache that we can use to speed up reporting. In most real scenarios where SSAS is used, there is a full copy of the data in the data warehouse. All reporting and analytic queries are run against SSAS rather than against the relational database. Today's modern relational databases include many features specifically aimed at BI reporting. SSAS are database services specifically designed for this type of workload, and in most cases it has achieved much better query performance. SSAS 2012 architecture In this article we will explain about the architecture of SSAS. The first and most important point to make about SSAS 2012 is that it is really two products in one package. It has had a few advancements relating to performance, scalability, and manageability. This new version of SSAS that closely resembles PowerPivot uses the tabular model. When installing SSAS, we must select either the tabular model or multidimensional model for installing an instance that runs inside the server; both data models are developed under the same code but sometimes both are treated separately. The concepts included in designing both data models are different, and we can't turn a tabular database into a multidimensional database, or vice versa without rebuilding everything from the start. The main point of view of the end users is that both data models do almost the same things and appear almost equally when used through a client tool such as Excel. The tabular model A concept of building a database using the tabular model is very similar to building it in a relational database. An instance of Analysis Services can hold many databases, and each database can be looked upon as a self-contained collection of objects and data relating to a single business solution. If we are writing reports or analyzing data and we find that we need to run queries on multiple databases, we probably have made a design mistake somewhere because everything we need should be contained within an individual database. Tabular models are designed by using SQL Server Data Tools (SSDT), and a data project in SSDT mapping onto a database in Analysis Services. The multidimensional model This data model is very similar to the tabular model. Data is managed in databases, and databases are designed in SSDT, which are in turn managed by using SQL Server Management Studio. The differences may become similar below the database level, where the multidimensional data model rather than relational concepts are accepted. In the multidimensional model, data is modeled as a series of cubes and dimensions and not tables. The future of Analysis Services We have two data models inside SSAS, along with two query and calculation languages; it is clearly not an ideal state of affairs. It means we have to select a data model to use at the start of our project, when we might not even know enough about our need to gauge which one is appropriate. It also means that anyone who decides to specialize in SSAS has to learn two technologies. Microsoft has very clearly said that the multidimensional model is not scrapped and that the tabular model is not its replacement. It is just like saying that the new advanced features for the multidimensional data model will be released in future versions of SSAS. The fact that the tabular and multidimensional data models share some of the same code suggests that some new features could easily be developed for both models simultaneously. What's new in SSAS 2012? As we know, there is no easy way of transferring a multidimensional data model into a tabular data model. We may have many tools in the market that claim to make this transition with a few mouse clicks, but such tools could only ever work for very simple multidimensional data models and would not save much development time. Therefore, if we already have a mature multidimensional implementation and the in-house skills to develop and maintain it, we may find the following improvements in SSAS 2012 useful. Ease of use If we are starting an SSAS 2012 project with no previous multidimensional or OLAP experience, it is very likely that we will find a tabular model much easier to learn than a multidimensional one. Not only are the concepts much easier to understand, especially if we are used to working with relational databases, but also the development process is much more straightforward and there are far fewer features to learn. Compatibility with PowerPivot The tabular data model and PowerPivot are the same in the way their models are designed. The user interfaces used are practically the same, as both the interfaces use DAX. PowerPivot models can be imported into SQL Server Data Tools to generate a tabular model, although the process does not work the other way around, and a tabular model cannot be converted to a PowerPivot model. Processing performance characteristics If we compare the processing performance of the multidimensional and tabular data models, that will become difficult. It may be slower to process a large table following the tabular data model than the equivalent measure group in a multidimensional one because a tabular data model can't process partitions in the same table at the same time, whereas a multidimensional model can process partitions in the same measure group at the same time. What is SSAS dimension? A database dimension is a collection of related objects; in other words, attributes; they provide the information about fact data in one or more cubes. Typical attributes in a product dimension are product name, product category, line, size, and price. Attributes can be organized into user-defined hierarchies that provide the paths to assist users when they browse through the data in a cube. By default these attributes are visible as attribute hierarchies, and they can be used to understand fact data in a cube. What is SSAS cube? A cube is a multidimensional structure that contains information for analytical purposes; the main constituents of a cube are dimensions and measures. Dimensions define the structure of a cube that you use to slice and dice over, and measures provide the aggregated numerical values of interest to the end user. As a logical structure, a cube allows a client application to retrieve values—of measures—as if they are contained in cells in the cube. The cells are defined for every possible summarized value. A cell, in the cube, is defined by the intersection of dimension members and contains the aggregated values of the measures at that specific intersection. Summary We talked about the special new features and services present, what you can do with them, and why they’re so great. Resources for Article: Further resources on this subject: Creating an Analysis Services Cube with Visual Studio 2008 - Part 1 [Article] Performing Common MDX-related Tasks [Article] How to Perform Iteration on Sets in MDX [Article]
Read more
  • 0
  • 0
  • 4453

article-image-dimensionality-reduction
Packt
03 Jan 2017
15 min read
Save for later

Dimensionality Reduction

Packt
03 Jan 2017
15 min read
In this article by Ashish Kumar and Avinash Paul the authors of the book Mastering Text Mining with R, we will Data volume and high dimensions pose an astounding challenge in text mining tasks. Inherent noise and the computational cost of processing huge amount of datasets make it even more sarduous. The science of dimensionality reduction lies in the art of losing out on only a commensurately small amount of information and still being able to reduce the high dimension space into a manageable proportion. (For more resources related to this topic, see here.) For classification and clustering techniques to be applied to text data, for different natural language processing activities, we need to reduce the dimensions and noise in the data so that each document can be represented using fewer dimensions, thus significantly reducing the noise that can hinder the performance. The curse of dimensionality Topic modeling and document clustering are common text mining activities, but the text data can be very high-dimensional, which can cause a phenomenon called curse of dimensionality. Some literature also calls it concentration of measure: Distance is attributed to all the dimensions and assumes each of them to have the same effect on the distance. The higher the dimensions, the more similar things appear to each other. The similarity measures do not take into account the association of attributes, which may result in inaccurate distance estimation. The number of samples required per attribute increases exponentially with the increase in dimensions. A lot of dimensions might be highly correlated with each other, thus causing multi-collinearity. Extra dimensions cause a rapid volume increase that can result in high sparsity, which is a major issue in any method that requires statistical significance. Also, it causes huge variance in estimates, near duplicates, and poor predictors. Distance concentration and computational infeasibility Distance concentration is a phenomenon associated with high-dimensional space wherein pairwise distances or dissimilarity between points appear indistinguishable. All the vectors in high dimensions appear to be orthogonal to each other. The distances between each data point to its neighbors, farthest or nearest, become equal. This totally jeopardizes the utility of methods that use distance based measures. Let's consider that the number of samples is n and the number of dimensions is d. If d is very large, the number of samples may prove to be insufficient to accurately estimate the parameters. For the datasets with number of dimensions d, the number of parameters in the covariance matrix will be d^2. In an ideal scenario, n should be much larger than d^2, to avoid overfitting. In general, there is an optimal number of dimensions to use for a given fixed number of samples. While it may feel like a good idea to engineer more features, if we are not able to solve a problem with less number of features. But the computational cost and model complexity increases with the rise in number of dimensions. For instance, if n number of samples look to be dense enough for a one-dimensional feature space. For a k-dimensional feature space, n^k samples would be required. Dimensionality reduction Complex and noisy characteristics of textual data with high dimensions can be handled by dimensionality reduction techniques. These techniques reduce the dimension of the textual data while still preserving its underlying statistics. Though the dimensions are reduced, it is important to preserve the inter-document relationships. The idea is to have minimum number of dimensions, which can preserve the intrinsic dimensionality of the data. A textual collection is mostly represented in form of a term document matrix wherein we have the importance of each term in a document. The dimensionality of such a collection increases with the number of unique terms. If we were to suggest the simplest possible dimensionality reduction method, that would be to specify the limit or boundary on the distribution of different terms in the collection. Any term that occurs with a significantly high frequency is not going to be informative for us, and the barely present terms can undoubtedly be ignored and considered as noise. Some examples of stop words are is, was, then, and the. Words that generally occur with high frequency and have no particular meaning are referred to as stop words. Words that occur just once or twice are more likely to be spelling errors or complicated words, and hence both these and stop words should not be considered for modeling the document in the Term Document Matrix (TDM). We will discuss a few dimensionality reduction techniques in brief and dive into their implementation using R. Principal component analysis Principal component analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs. PCA comprises the following steps: Compute the n-dimensional mean of the given dataset. Compute the covariance matrix of the features. Compute the eigenvectors and eigenvalues of the covariance matrix. Rank/sort the eigenvectors by descending eigenvalue. Choose x eigenvectors with the largest eigenvalues. Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space. PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis. Using R for PCA R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns. prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function. princomp() fails in situations if the number of variables is larger than the number of observations. Each function returns a list whose class is prcomp() or princomp(). The information returned and terminology is summarized in the following table: prcomp() princomp() Explanation sdev sdev Standard deviation of each column Rotations Loading Principle components Center Center Subtracted value of each row or column to get the center data Scale Scale Scale factors used X Score The rotated data   n.obs Number of observations of each variable   Call The call to function that created the object Here's a list of the functions available in different R packages for performing PCA: PCA(): FactoMineR package acp(): amap package prcomp(): stats package princomp(): stats package dudi.pca(): ade4 package pcaMethods: This package from Bioconductor has various convenient  methods to compute PCA Understanding the FactoMineR package FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package: library(FactoMineR) data<-replicate(10,rnorm(1000)) result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T) print(result.pca) Results for the principal component analysis (PCA). The analysis was performed on 1,000 individuals, described by nine variables. The results are available in the following objects: Name Description $eig Eigenvalues $var Results for the variables $var$coord coord. for the variables $var$cor Correlations variables - dimensions $var$cos2 cos2 for the variables $var$contrib Contributions of the variables $ind Results for the individuals $ind$coord coord. for the individuals $ind$cos2 cos2 for the individuals $ind$contrib Contributions of the individuals $call Summary statistics $call$centre Mean of the variables $call$ecart.type Standard error of the variables $call$row.w Weights for the individuals $call$col.w Weights for the variables Eigenvalue percentage of variance cumulative percentage of variance: comp 1 1.1573559 12.859510 12.85951 comp 2 1.0991481 12.212757 25.07227 comp 3 1.0553160 11.725734 36.79800 comp 4 1.0076069 11.195632 47.99363 comp 5 0.9841510 10.935011 58.92864 comp 6 0.9782554 10.869505 69.79815 comp 7 0.9466867 10.518741 80.31689 comp 8 0.9172075 10.191194 90.50808 comp 9 0.8542724 9.491916 100.00000 Amap package Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame. This function is akin to princomp() and prcomp(), except that it has slightly different graphic represention. For more intricate details, refer to the CRAN-R resource page: https://cran.r-project.org/web/packages/lLibrary(amap/amap.pdf Library(amap acp(data,center=TRUE,reduce=TRUE) Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package: acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien") W(x,h,D=NULL,kernel="gaussien") acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien") Proportion of variance We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence. R has a prcomp() function in the base package to estimate principal components. Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits: pca_base<-prcomp(data) print(pca_base) The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains: pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100 pr_variance [1] 11.678126 11.301480 10.846161 10.482861 10.176036 9.605907 9.498072 [8] 9.218186 8.762572 8.430598 pr_variance signifies the proportion of variance explained by each component in descending order of magnitude. Let's calculate the cumulative proportion of variance for the components: cumsum(pr_variance) [1] 11.67813 22.97961 33.82577 44.30863 54.48467 64.09057 73.58864 [8] 82.80683 91.56940 100.00000 Components 1-8 explain the 82% variance in the data. Singular vector decomposition Singular vector decomposition (SVD) is a dimensionality reduction technique that gained a lot of popularity in recent times after the famous Netflix Movie Recommendation challenge. Since its inception, it has found its usage in many applications in statistics, mathematics, and signal processing. It is primarily a technique to factorize any matrix; it can be real or a complex matrix. A rectangular matrix can be factorized into two orthonormal matrices and a diagonal matrix of positive real values. An m*n matrix is considered as m points in n-dimensional space; SVD attempts to find the best k dimensional subspace that fits the data: SVD in R is used to compute approximations of singular values and singular vectors of large-scale data matrices. These approximations are made using different types of memory-efficient algorithm, and IRLBA is one of them (named after Lanczos bi-diagonalization (IRLBA) algorithm). We shall be using the irlba package here in order to implement SVD. Implementation of SVD using R The following code will show the implementation of SVD using R: # List of packages for the session packages = c("foreach", "doParallel", "irlba") # Install CRAN packages (if not already installed) inst <- packages %in% installed.packages() if(length(packages[!inst]) > 0) install.packages(packages[!inst]) # Load packages into session lapply(packages, require, character.only=TRUE) # register the parallel session for registerDoParallel(cores=detectCores(all.tests=TRUE)) std_svd <- function(x, k, p=25, iter=0 1 ) { m1 <- as.matrix(x) r <- nrow(m1) c <- ncol(m1) p <- min( min(r,c)-k,p) z <- k+p m2 <- matrix ( rnorm(z*c), nrow=c, ncol=z) y <- m1 %*% m2 q <- qr.Q(qr(y)) b<- t(q) %*% m1 #iterations b1<-foreach( i=i1:iter ) %dopar% { y1 <- m1 %*% t(b) q1 <- qr.Q(qr(y1)) b1 <- t(q1) %*% m1 } b1<-b1[[iter]] b2 <- b1 %*% t(b1) eigens <- eigen(b2, symmetric=T) result <- list() result$svalues <- sqrt(eigens$values)[1:k] u1=eigens$vectors[1:k,1:k] result$u <- (q %*% eigens$vectors)[,1:k] result$v <- (t(b) %*% eigens$vectors %*% diag(1/eigens$values))[,1:k] return(result) } svd<- std_svd(x=data,k=5)) # singular vectors svd$svalues [1] 35.37645 33.76244 32.93265 32.72369 31.46702 We obtain the following values after running SVD using the IRLBA algorithm: d: approximate singular values. u: nu approximate left singular vectors v: nv approximate right singular vectors iter: # of IRLBA algorithm iterations mprod: # of matrix vector products performed These values can be used for obtaining results of SVD and understanding the overall statistics about how the algorithm performed. Latent factors # svd$u, svd$v dim(svd$u) #u value after running IRLBA [1] 1000 5 dim(svd$v) #v value after running IRLBA [1] 10 5 A modified version of the previous function can be achieved by altering the power iterations for a robust implementation: foreach( i = 1:iter )%dopar% { y1 <- m1 %*% t(b) y2 <- t(y1) %*% y1 r2 <- chol(y2, pivot = T) q1 <- y2 %*% solve(r2) b1 <- t(q1) %*% m1 } b2 <- b1 %*% t(b1) Some other functions available in R packages are as follows: Functions Package svd() svd Irlba() irlba svdImpute bcv ISOMAP – moving towards non-linearity ISOMAP is a nonlinear dimension reduction method and is representative of isometric mapping methods. ISOMAP is one of the approaches for manifold learning. ISOMAP finds the map that preserves the global, nonlinear geometry of the data by preserving the geodesic manifold inter-point distances. Like multi-dimensional scaling, ISOMAP creates a visual presentation of distance of a number of objects. Geodesic is the shortest curve along the manifold connecting two points induced by a neighborhood graph. Multi-dimensional scaling uses the Euclidian distance measure; since the data is in a nonlinear format, ISOMPA uses geodesic distance. ISOMAP can be viewed as an extension of metric multi-dimensional scaling. At a very high level, ISOMAP can be describes in four steps: Determine the neighbor of each point Construct a neighborhood graph Compute the shortest distance path between all pairs Construct k-dimensional coordinate vectors by applying MDS Geodesic distance approximation is basically calculated in three ways: Neighboring points: input-space distance Faraway points: a sequence of short hops between neighboring points Method: Finding shortest paths in a graph with edges connecting neighboring data points source("http://bioconductor.org/biocLite.R") biocLite("RDRToolbox") library('RDRToolbox') swiss_Data=SwissRoll(N = 1000, Plot=TRUE) x=SwissRoll() open3d() plot3d(x, col=rainbow(1050)[-c(1:50)],box=FALSE,type="s",size=1) simData_Iso = Isomap(data=swiss_Data, dims=1:10, k=10,plotResiduals=TRUE) library(vegan)data(BCI) distance <- vegdist(BCI) tree <- spantree(dis) pl1 <- ordiplot(cmdscale(dis), main="cmdscale") lines(tree, pl1, col="red") z <- isomap(distance, k=3) rgl.isomap(z, size=4, color="red") pl2 <- plot(isomap(distance, epsilon=0.5), main="isomap epsilon=0.5") pl3 <- plot(isomap(distance, k=5), main="isomap k=5") pl4 <- plot(z, main="isomap k=3") Summary The idea of this article was to get you familiar with some of the generic dimensionality reduction methods and their implementation using R language. We discussed a few packages that provide functions to perform these tasks. We also covered a few custom functions that can be utilized to perform these tasks. Kudos, you have completed the basics of text mining with R. You must be feeling confident about various data mining methods, text mining algorithms (related to natural language processing of the texts) and after reading this article, dimensionality reduction. If you feel a little low on confidence, do not be upset. Turn a few pages back and try implementing those tiny code snippets on your own dataset and figure out how they help you understand your data. Remember this - to mine something, you have to get into it by yourself. This holds true for text as well. Resources for Article: Further resources on this subject: Data Science with R [Article] Machine Learning with R [Article] Data mining [Article]
Read more
  • 0
  • 0
  • 4433

article-image-oracle-business-intelligence-drilling-data-and-down
Packt
21 Oct 2010
5 min read
Save for later

Oracle Business Intelligence: Drilling Data Up and Down

Packt
21 Oct 2010
5 min read
What is data drilling? In terms of Oracle Discoverer, drilling is a technique that enables you to quickly navigate through worksheet data, finding the answers to the questions facing your business. As mentioned, depending on your needs, you can use drilling to view the data you're working with in deeper detail or, in contrast, drill it up to a higher level. The drilling to detail technique enables you to look at the values making up a particular summary value. Also, you can drill to related items, adding related information that is not currently included in the worksheet. So, Discoverer supports a set of drilling tools, including the following: Drilling up and down Drilling to a related item Drilling to detail Drilling out The following sections cover the above tools in detail, providing examples on how you might use them. Drilling to a related item Let's begin with a discussion on how to drill to a related item, adding the detailed information for a certain item. As usual, this is best understood by example. Suppose you want to drill from the Maya Silver item, which can be found on the left axis of the worksheet, to the Orddate:Day item. Here are the steps to follow: Let's first create a copy of the worksheet to work with in this example. To do this, move to the worksheet discussed in the preceding example and select the Edit | Duplicate Worksheet | As Crosstab menu of Discoverer. In the Duplicate as Crosstab dialog, just click OK. As a result a copied worksheet should appear in the workbook. On the worksheet, right-click the Maya Silver item and select Drill… in the pop-up menu: As a result, the Drill dialog should appear. In the Drill dialog, select Drill to a Related Item in the Where do you want to drill to? select box and then choose the Orddate:Day item, as shown in the following screenshot: Then, click OK to close the dialog and rearrange the data on the worksheet. The reorganized worksheet should now look like the following one: As you can see, this shows the Maya Silver item broken down into day sales per product. Now suppose you want to see a more detailed view of the Maya Silver item and break it out further into product category. Right-click the Maya Silver item and select Drill… in the pop-up menu. In the Drill dialog, select Drill to a Related Item in the Where do you want to drill to? select box and then choose the Category item. Next, click OK.The resulting worksheet should look now like this: As you can see, the result of the drilling operations you just performed is that you can see the dollar amount for Maya Silver detailed by category, by day, by product. You may be asking yourself if it's possible to change the order in which the Maya Silver record is detailed. Say, you want to see it detailed in the following order: by day, by category, and finally by product. The answer is sure. On the left axis of the worksheet, drag the Orddate:Day item (the third from the left) to the second position within the same left axis, just before the Category item, as shown in the following screenshot: As a result, you should see that the data on the worksheet has been rearranged as shown in the following screenshot: Having just a few rows in the underlying tables, as we have here, is OK for demonstration purposes, since it results in compact screenshots. To see more meaningful figures on the worksheet though, you might insert more rows into the orderitems, orders, and products underlying tables. Once you're done with it, you can click the Refresh button on the Discoverer toolbar to see an updated worksheet. Select the File | Save menu option of Discoverer to save the worksheet discussed here. Drilling up and down As the name implies, drilling down is a technique you can use to float down a drill hierarchy to see data in more detail. And drilling up is the reverse operation, which you can use to slide up a drill hierarchy to see consolidated data. But what is a drill hierarchy? Working with drill hierarchies A drill hierarchy represents a set of items related to each other according to the foreign key relationships in the underlying tables. If a worksheet item is associated with a drill hierarchy, you can look at that hierarchy by clicking the drill icon located at the left of the heading of the worksheet item. Suppose you want to look at the hierarchy associated with the Orddate item located on our worksheet at the top axis. To do this, click the Orddate drill icon. As a result, you should see the menu shown in the following screeenshot: As you can see, you can drill up here from Orddate to Year, Quarter, or Month. The next screenshot illustrates what you would have if you chose Month. It's important to note that you may have more than one hierarchy associated with a worksheet item. In this case, you can move on to the hierarchy you want to use through the All Hierarchies option on the drill menu.
Read more
  • 0
  • 0
  • 4424
article-image-logistic-regression-using-tensorflow
Packt
06 Mar 2018
9 min read
Save for later

Logistic Regression Using TensorFlow

Packt
06 Mar 2018
9 min read
In this article, by PKS Prakash and Achyutuni Sri Krishna Rao, authors of R Deep Learning Cookbook we will learn how to Perform logistic regression using TensorFlow. In this recipe, we will cover the application of TensorFlow in setting up a logistic regression model. The example will use a similar dataset to that used in the H2O model setup. (For more resources related to this topic, see here.) What is TensorFlow TensorFlow is another open source library developed by the Google Brain Team to build numerical computation models using data flow graphs. The core of TensorFlow was developed in C++ with the wrapper in Python. The tensorflow package in R gives you access to the TensorFlow API composed of Python modules to execute computation models. TensorFlow supports both CPU- and GPU-based computations. The tensorflow package in R calls the Python tensorflow API for execution, which is essential to install the tensorflow package in both R and Python to make R work. The following are the dependencies for tensorflow: Python 2.7 / 3.x  R (>3.2) devtools package in R for installing TensorFlow from GitHub  TensorFlow in Python pip Getting ready The code for this section is created on Linux but can be run on any operating system. To start modeling, load the tensorflow package in the environment. R loads the default TensorFlow environment variable and also the NumPy library from Python in the np variable:  library("tensorflow") # Load TensorFlow np <- import("numpy") # Load numpy library How to do it... The data is imported using a standard function from R, as shown in the following code. The data is imported using the read.csv file and transformed into the matrix format followed by selecting the features used to model as defined in xFeatures and yFeatures. The next step in TensorFlow is to set up a graph to run optimization: # Loading input and test data xFeatures = c("Temperature", "Humidity", "Light", "CO2", "HumidityRatio") yFeatures = "Occupancy" occupancy_train <- as.matrix(read.csv("datatraining.txt",stringsAsFactors = T)) occupancy_test <- as.matrix(read.csv("datatest.txt",stringsAsFactors = T)) # subset features for modeling and transform to numeric values occupancy_train<-apply(occupancy_train[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) occupancy_test<-apply(occupancy_test[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) # Data dimensions nFeatures<-length(xFeatures) nRow<-nrow(occupancy_train) Before setting up the graph, let's reset the graph using the following command: # Reset the graph tf$reset_default_graph() Additionally, let's start an interactive session as it will allow us to execute variables without referring to the session-to-session object: # Starting session as interactive session sess<-tf$InteractiveSession() Define the logistic regression model in TensorFlow: # Setting-up Logistic regression graph x <- tf$constant(unlist(occupancy_train[, xFeatures]), shape=c(nRow, nFeatures), dtype=np$float32) # W <- tf$Variable(tf$random_uniform(shape(nFeatures, 1L))) b <- tf$Variable(tf$zeros(shape(1L))) y <- tf$matmul(x, W) + b The input feature x is defined as a constant as it will be an input to the system. The weight W and bias b are defined as variables that will be optimized during the optimization process. The y is set up as a symbolic representation between x, W, and b. The weight W is set up to initialize random uniform distribution and b is assigned the value zero.  The next step is to set up the cost function for logistic regression: # Setting-up cost function and optimizer y_ <- tf$constant(unlist(occupancy_train[, yFeatures]), dtype="float32", shape=c(nRow, 1L)) cross_entropy<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=y_, logits=y, name="cross_entropy")) optimizer <- tf$train$GradientDescentOptimizer(0.15)$minimize(cross_entr opy) # Start a session init <- tf$global_variables_initializer() sess$run(init) Execute the gradient descent algorithm for the optimization of weights using cross entropy as the loss function: # Running optimization for (step in 1:5000) { sess$run(optimizer) if (step %% 20== 0) cat(step, "-", sess$run(W), sess$run(b), "==>", sess$run(cross_entropy), "n") } How it works... The performance of the model can be evaluated using AUC: # Performance on Train library(pROC) ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) # Performance on test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)). AUC can be visualized using the plot.auc function from the pROC package, as shown in the screenshot following this command. The performance for training and testing (holdout) is very similar. plot.roc(roc_obj, col = "green", lty=2, lwd=2) plot.roc(roc_objt, add=T, col="red", lty=4, lwd=2) Performance of logistic regression using TensorFlow Visualizing TensorFlow graphs TensorFlow graphs can be visualized using TensorBoard. It is a service that utilizes TensorFlow event files to visualize TensorFlow models as graphs. Graph model visualization in TensorBoard is also used to debug TensorFlow models. Getting ready TensorBoard can be started using the following command in the terminal: $ tensorboard --logdir home/log --port 6006 The following are the major parameters for TensorBoard: --logdir : To map to the directory to load TensorFlow events --debug: To increase log verbosity  --host: To define the host to listen to its localhost (127.0.0.1) by default  --port: To define the port to which TensorBoard will serve The preceding command will launch the TensorFlow service on localhost at port 6006, as shown in the following screenshot:                                                                                                                                                         TensorBoard The tabs on the TensorBoard capture relevant data generated during graph execution. How to do it... The section covers how to visualize TensorFlow models and output in TernsorBoard.  To visualize summaries and graphs, data from TensorFlow can be exported using the FileWriter command from the summary module. A default session graph can be added using the following command:  # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) The graph for logistic regression developed using the preceding code is shown in the following screenshot:                                                                                 Visualization of the logistic regression graph in TensorBoard Similarly, other variable summaries can be added to the TensorBoard using correct summaries, as shown in the following code: # Adding histogram summary to weight and bias variable w_hist = tf$histogram_summary("weights", W) b_hist = tf$histogram_summary("biases", b) Create a cross entropy evaluation for test. An example script to generate the cross entropy cost function for test and train is shown in the following command: # Set-up cross entropy for test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- tf$nn$sigmoid(tf$matmul(xt, W) + b) yt_ <- tf$constant(unlist(occupancy_test[, yFeatures]), dtype="float32", shape=c(nRowt, 1L)) cross_entropy_tst<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=yt_, logits=ypredt, name="cross_entropy_tst")) Add summary variables to be collected: # Add summary ops to collect data w_hist = tf$summary$histogram("weights", W) b_hist = tf$summary$histogram("biases", b) crossEntropySummary<-tf$summary$scalar("costFunction", cross_entropy) crossEntropyTstSummary<- tf$summary$scalar("costFunction_test", cross_entropy_tst) Open the writing object, log_writer. It writes the default graph to the location, c:/log: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) Run the optimization and collect the summaries: for (step in 1:2500) { sess$run(optimizer) # Evaluate performance on training and test data after 50 Iteration if (step %% 50== 0){ ### Performance on Train ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) ### Performance on Test ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)) cat("train AUC: ", auc(roc_obj), " Test AUC: ", auc(roc_objt), "n") # Save summary of Bias and weights log_writer$add_summary(sess$run(b_hist), global_step=step) log_writer$add_summary(sess$run(w_hist), global_step=step) log_writer$add_summary(sess$run(crossEntropySummary), global_step=step) log_writer$add_summary(sess$run(crossEntropyTstSummary), global_step=step) } } Collect all the summaries to a single tensor using the merge_all command from the summary module: summary = tf$summary$merge_all() Write the summaries to the log file using the log_writer object: log_writer = tf$summary$FileWriter('c:/log', sess$graph) summary_str = sess$run(summary) log_writer$add_summary(summary_str, step) log_writer$close() Summary In this article, we have learned how to perform logistic regression using TensorFlow also we have covered the application of TensorFlow in setting up a logistic regression model. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 4403

article-image-top-5-machine-learning-movies
Chris Key
17 Oct 2017
3 min read
Save for later

Top 5 Machine Learning Movies

Chris Key
17 Oct 2017
3 min read
Sitting in Mumbai airport at 2am can lead to some truly random conversations. Discussing the plot of Short Circuit 2 led us to thinking about this article. Here's my list of the top 5 movies featuring advanced machine learning. Short Circuit 2 [imdb] "Hey laser-lips, your momma was a snow blower!" A plucky robot who has named himself Johnny 5 returns to the screens to help build toy robots in a big city. By this point he is considered to have actual intelligence rather than artificial intelligence, however the plot of the film centres around his naivety and lack of ability to see the dark motives behind his new buddy, Oscar. We learn that intelligence can be applied anywhere, but sometimes it is the wrong place. Or right if you like stealing car stereos for "Los Locos". The Matrix Revolutions [imdb] The robots learn to balance an equation. Bet you wish you had them in your math high-school class. Also kudos to the Wachowski brothers who learnt from the machines the ability to balance the equation and released this monstrosity to even out the universe in light of the amazing first film in the trilogy. Blade Runner [imdb] “I've seen things you people wouldn't believe.” In the ultimate example of machines (see footnote) learning to emulate humanity, we struggled for 30 years to understand if Deckard was really human or a Nexus (spoilers: he is almost certainly a replicant!). It is interesting to note that when Pris and Roy are teamed up with JF Sebastian, their behaviours, aside from the occasional murder, show them to be more socially aware than their genius inventor friend. Wall-E [imdb] Disney and Pixar made a movie with no dialog for the entire first half, yet it was enthralling to watch. Without saying a single word, we see a small utility robot display a full range of emotions that we can relate to. He also demonstrates other signs of life – his need for energy and rest, and his sense of purpose is divided between his prime directive of cleaning the planet, and his passion for collecting interesting objects. Terminator 2 [imdb] “I know now why you cry, but it is something I can never do” Sarah Connor tells us that “Watching John with the machine, it was suddenly so clear. The terminator, would never stop. It would never leave him, and it would never hurt him, never shout at him, or get drunk and hit him, or say it was too busy to spend time with him. It would always be there. And it would die, to protect him.” Yet John Connor teaches the deadly robot, played by the invincible ex-Governator Arnold Schwarzenegger, how to be normal in society. No Problemo. Gimme five. Hasta La Vista, baby. Footnote - replicants aren't really machines. The replicants are genetic engineered and created by the Tyrell corporation with limited lifespans and specific abilities. For all intents and purposes, they are really organic robots.
Read more
  • 0
  • 0
  • 4394
Modal Close icon
Modal Close icon