Data | Tech News, Tutorials & Expert Insights

article-image-creating-simple-report-using-birt

23 Oct 2009

5 min read

Creating a Simple Report using BIRT

23 Oct 2009

Setting up a Simple Project The first thing we want to do when setting up our simple report project is to define what the project is going to be, and what our first simple report will be. Our first report will be a simple dump of the employees who work for Classic Cars. So, the first thing we need to do is set up a project. To do this, we will use the Navigator. Make sure you have the BIRT report perspective open. Use the following steps to create our project: Open up the Navigator by single-clicking on the Navigator tab. Right-click anywhere in the white-space in the Navigator. Select New from the menu, and under New select Project. From the Dialog screen, select Business Intelligence and Reporting Tools from the list of folders; expand that view, and select Report Project. Then click on the Next button. For the Project name, enter Class_Cars_BIRT_Reports. You can either leave the Use Default Location checkbox checked, or uncheck it and enter a location on your local drive to store this report project. Now, we have a very simple report project in which to store our BIRT reports starting with the first that we are about to create. Creating a Simple Report Now that we have our first project open, we will look at creating our first report. As mentioned earlier, we will create a basic listing report that will display all the information in the employees table. In order to do this, we will use the following steps: Right-click on the Class_Cars_BIRT_Reports project under the Navigator, and choose New and Report. Make sure the Class_Cars_BIRT_Reports project is highlighted in the new report Dialog, and enter in the name as EmployeeList.rptdesign. I chose this name as it is somewhat descriptive of the purpose of the report, which is to display a list of employees. As a rule of thumb, always try to name your reports after the expected output, such as QuarterlyEarningReport.rptdesign, weeklyPayStub.rptdesign, or accountsPayable.rptdesign. On the next screen is a list of different report templates that we can use. We will select Simple Listing and then click on the Finish button. Go to the Data Explorer, right-click on Data Sources, and choose New Data Source. From the New Data Source Dialog box, select Classic Models Inc. Sample Database and click on the Next button. On the next screen, it will inform you of the driver information. You can ignore this for now and click Finish. Under the Data Explorer, right-click on Data Sets and choose New Data Set. On the next screen, enter the Data Set Name as dsetEmployees, and make sure that our created Data Source is selected in the list of Data sources. You can click Next when this is finished. On the Query Dialog, enter the following query and click Finish: On the next screen, just click OK. This screen is used to edit information about Data Sets, and we will ignore it for now. Now, from the Outline select Data Sets and expand it to show all of the fields. Drag the EMPLOYEENUMBER element over to the Report Designer, and drop it on the cell with the label of Detail Row. This will be the second row and the first column. You will notice that when you do this, the header row also gets an element placed in it called EMPLOYEENUMBER. This is the Header label. Double- click on this cell and it will become highlighted. We can now edit it. Type in "Employee ID". Drag and drop the LASTNAME, FIRSTNAME, and JOBTITLE to the detail cells to the right of the EMPLOYEENUMBER cell. Now, we want to put the header row in bold. Under the Outline, select the Row element located under Body/Table/Header. This will change the Property Editor. Click on Font, and then click on the Bold button. That's it! We have created our first basic report. To see what this report looks like, under the Report Designer pane, click on the Preview tab. This will allow you to get a good idea of what this report will look like. Alternatively you can actually Run the report and get an idea what this report will look like in the BIRT Report Viewer application, by going up to File/View Report/View Report in Web Viewer. This option is also available by right-click on the report design file under the Navigator, and choosing Report followed by Run. Although it may be a simple report, this exercise demonstrated how a report developer can get through the BIRT environment, and how the different elements of the BIRT perspective work together. Summary For a very simple report design, we utilized all of the major areas of the BIRT perspective. We used the Navigator to create a new report project and a new report design, the Data explorer to create out data connection and Data Set, dragged elements from the Outline to the Report Designer to get the data elements into the right place, and used the Property Editor and Outline cooperatively to bold the text in the table header.

0
0
4572

Packt

04 Apr 2016

20 min read

Morphology – Getting Our Feet Wet

Packt

04 Apr 2016

20 min read

In this article by Deepti Chopra, Nisheeth Joshi, and Iti Mathur authors of the book Mastering Natural Language Processing with Python, morphology may be defined as the study of the composition of words using morphemes. A morpheme is the smallest unit of the language that has a meaning. In this article, we will discuss stemming and lemmatizing, creating a stemmer and lemmatizer for non-English languages, developing a morphological analyzer and morphological generator using machine learning tools, creating a search engine, and many other concepts. In brief, this article will include the following topics: Introducing morphology Creating a stemmer and lemmatizer Developing a stemmer for non-English languages Creating a morphological analyzer Creating a morphological generator Creating a search engine (For more resources related to this topic, see here.) Introducing morphology Morphology may be defined as the study of the production of tokens with the help of morphemes. A morpheme is the basic unit of language, which carries a meaning. There are two types of morphemes: stems and affixes (suffixes, prefixes, infixes, and circumfixes). Stems are also referred to as free morphemes since they can even exist without adding affixes. Affixes are referred to as bound morphemes since they cannot exist in a free form, and they always exist along with free morphemes. Consider the word "unbelievable". Here, "believe" is a stem or free morpheme. It can even exist on its own. The morphemes "un" and "able" are affixes or bound morphemes. They cannot exist in s free form but exist together with a stem. There are three kinds of languages, namely isolating languages, agglutinative languages, and inflecting languages. Morphology has different meanings in all these languages. Isolating languages are those languages in which words are merely free morphemes, and they do not carry any tense (past, present, and future) or number (singular or plural) information. Mandarin Chinese is an example of an isolating language. Agglutinative languages are those languages in which small words combine together to convey compound information. Turkish is an example of an agglutinative language. Inflecting languages are languages in which words are broken down into simpler units, but all these simpler units exhibit different meanings. Latin is an example of an inflecting language. There are morphological processes such as inflections, derivations, semi-affixes, combining forms, and cliticization. An inflection refers to transforming a word into a form so that it represents a person, number, tense, gender, case, aspect, and mood. Here, the syntactic category of the token remains the same. In derivation, the syntactic category of word is also changed. Semi-affixes are bound morphemes that exhibit a word-like quality, for example, noteworthy, antisocial, anticlockwise, and so on. Understanding stemmers Stemming may be defined as the process of obtaining a stem from a word by eliminating the affixes from it. For example, in the word "raining", a stemmer would return the root word or the stem word "rain" by removing the affix "ing" from "raining". In order to increase the accuracy of information retrieval, search engines mostly use stemming to get a stem and store it as an index word. Search engines call words with the same meaning synonyms, which may be a kind of query expansion known as conflation. Martin Porter has designed a well-known stemming algorithm known as the Porter Stemming Algorithm. This algorithm is basically designed to replace and eliminate some well-known suffices present in English words. To perform stemming in NLTK, we can simply perform the instantiation of the PorterStemmer class, and then perform stemming by calling the stem method. Let's take a look at the code for stemming using the PorterStemmer class in NLTK: >>> import nltk>>> from nltk.stem import PorterStemmer>>> stemmerporter = PorterStemmer()>>> stemmerporter.stem('working')'work'>>> stemmerporter.stem('happiness')'happi' The PorterStemmer class is trained and has the knowledge of many stems and word forms in the English language. The process of stemming takes place in a series of steps and transforms a word into a shorter word or this word may similar meaning to the root word. The stemmer I interface defines the stem() method, and all stemmers are inherited from this interface. The inheritance diagram is depicted here: Another Stemming algorithm, known as the Lancaster Stemming algorithm, was introduced in Lancaster University. Similar to the PorterStemmer class, the LancasterStemmer class is used in NLTK to implement Lancaster Stemming. Let's consider the following code, which depicts Lancaster stemming in NLTK: >>> import nltk >>> from nltk.stem import LancasterStemmer >>> stemmerlan=LancasterStemmer() >>> stemmerlan.stem('working') 'work' >>> stemmerlan.stem('happiness') 'happy' We can also build our own stemmer in NLTK using RegexpStemmer. This works by accepting a string and eliminates it from the prefix or suffix of a word when a match is found. Let's consider an example of stemming using RegexpStemmer in NLTK: >>> import nltk >>> from nltk.stem import RegexpStemmer >>> stemmerregexp=RegexpStemmer('ing') >>> stemmerregexp.stem('working') 'work' >>> stemmerregexp.stem('happiness') 'happiness' >>> stemmerregexp.stem('pairing') 'pair' We can use RegexpStemmer in cases where stemming cannot be performed using PorterStemmer and LancasterStemmer. The SnowballStemmer class is used to perform stemming in 13 languages other than English. In order to perform stemming using SnowballStemmer, firstly, an instance is created in the language where stemming needs to be performed, and then using the stem() method, stepping is performed. Consider the following example to perform stemming in Spanish and French in NLTK using SnowballStemmer: >>> import nltk >>> from nltk.stem import SnowballStemmer >>> SnowballStemmer.languages ('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish') >>> spanishstemmer=SnowballStemmer('spanish') >>> spanishstemmer.stem('comiendo') 'com' >>> frenchstemmer=SnowballStemmer('french') >>> frenchstemmer.stem('manger') 'mang' Nltk.stem.api consists of the stemmer I class in which the stem function is performed. Consider the following code present in NLTK, which enables stemming to be performed: Class StemmerI(object): """ It is an interface that helps to eliminate morphological affixes from the tokens and the process is known as stemming. """ def stem(self, token): """ Eliminate affixes from token and stem is returned. """ raise NotImplementedError() Here's the code used to perform stemming using multiple stemmers: >>> import nltk >>> from nltk.stem.porter import PorterStemmer >>> from nltk.stem.lancaster import LancasterStemmer >>> from nltk.stem import SnowballStemmer >>> def obtain_tokens(): With open('/home/p/NLTK/sample1.txt') as stem: tok = nltk.word_tokenize(stem.read()) return tokens >>> def stemming(filtered): stem=[] for x in filtered: stem.append(PorterStemmer().stem(x)) return stem >>> if_name_=="_main_": tok= obtain_tokens() >>> print("tokens is %s")%(tok) >>> stem_tokens= stemming(tok) >>> print("After stemming is %s")%stem_tokens >>> res=dict(zip(tok,stem_tokens)) >>> print("{tok:stemmed}=%s")%(result) Understanding lemmatization Lemmatization is the process in which we transform a word into a form that has a different word category. The word formed after lemmatization is entirely different from what it was initially. Consider an example of lemmatization in NLTK: >>> import nltk >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer_output=WordNetLemmatizer() >>> lemmatizer_output.lemmatize('working') 'working' >>> lemmatizer_output.lemmatize('working',pos='v') 'work' >>> lemmatizer_output.lemmatize('works') 'work' WordNetLemmatizer may be defined as a wrapper around the so-called WordNet corpus, and it makes use of the morphy() function present in WordNetCorpusReader to extract a lemma. If no lemma is extracted, then the word is only returned in its original form. For example, for 'works', the lemma that is returned is in the singular form 'work'. This code snippet illustrates the difference between stemming and lemmatization: >>> import nltk >>> from nltk.stem import PorterStemmer >>> stemmer_output=PorterStemmer() >>> stemmer_output.stem('happiness') 'happi' >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer_output.lemmatize('happiness') 'happiness' In the preceding code, 'happiness' is converted to 'happi' by stemming it. Lemmatization can't find the root word for 'happiness', so it returns the word "happiness". Developing a stemmer for non-English languages Polyglot is a software that is used to provide models called morfessor models, which are used to obtain morphemes from tokens. The Morpho project's goal is to create unsupervised data-driven processes. Its focuses on the creation of morphemes, which are the smallest units of syntax. Morphemes play an important role in natural language processing. They are useful in automatic recognition and the creation of language. With the help of the vocabulary dictionaries of polyglot, morfessor models on 50,000 tokens of different languages was used. Here's the code to obtain a language table using a polyglot: from polyglot.downloader import downloader print(downloader.supported_languages_table("morph2")) The output obtained from the preceding code is in the form of these languages listed as follows: 1. Piedmontese language 2. Lombard language 3. Gan Chinese 4. Sicilian 5. Scots 6. Kirghiz, Kyrgyz 7. Pashto, Pushto 8. Kurdish 9. Portuguese 10. Kannada 11. Korean 12. Khmer 13. Kazakh 14. Ilokano 15. Polish 16. Panjabi, Punjabi 17. Georgian 18. Chuvash 19. Alemannic 20. Czech 21. Welsh 22. Chechen 23. Catalan; Valencian 24. Northern Sami 25. Sanskrit (Saṁskṛta) 26. Slovene 27. Javanese 28. Slovak 29. Bosnian-Croatian-Serbian 30. Bavarian 31. Swedish 32. Swahili 33. Sundanese 34. Serbian 35. Albanian 36. Japanese 37. Western Frisian 38. French 39. Finnish 40. Upper Sorbian 41. Faroese 42. Persian 43. Sinhala, Sinhalese 44. Italian 45. Amharic 46. Aragonese 47. Volapük 48. Icelandic 49. Sakha 50. Afrikaans 51. Indonesian 52. Interlingua 53. Azerbaijani 54. Ido 55. Arabic 56. Assamese 57. Yoruba 58. Yiddish 59. Waray-Waray 60. Croatian 61. Hungarian 62. Haitian; Haitian Creole 63. Quechua 64. Armenian 65. Hebrew (modern) 66. Silesian 67. Hindi 68. Divehi; Dhivehi; Mald... 69. German 70. Danish 71. Occitan 72. Tagalog 73. Turkmen 74. Thai 75. Tajik 76. Greek, Modern 77. Telugu 78. Tamil 79. Oriya 80. Ossetian, Ossetic 81. Tatar 82. Turkish 83. Kapampangan 84. Venetian 85. Manx 86. Gujarati 87. Galician 88. Irish 89. Scottish Gaelic; Gaelic 90. Nepali 91. Cebuano 92. Zazaki 93. Walloon 94. Dutch 95. Norwegian 96. Norwegian Nynorsk 97. West Flemish 98. Chinese 99. Bosnian 100. Breton 101. Belarusian 102. Bulgarian 103. Bashkir 104. Egyptian Arabic 105. Tibetan Standard, Tib... 106. Bengali 107. Burmese 108. Romansh 109. Marathi (Marāthī) 110. Malay 111. Maltese 112. Russian 113. Macedonian 114. Malayalam 115. Mongolian 116. Malagasy 117. Vietnamese 118. Spanish; Castilian 119. Estonian 120. Basque 121. Bishnupriya Manipuri 122. Asturian 123. English 124. Esperanto 125. Luxembourgish, Letzeb... 126. Latin 127. Uighur, Uyghur 128. Ukrainian 129. Limburgish, Limburgan... 130. Latvian 131. Urdu 132. Lithuanian 133. Fiji Hindi 134. Uzbek 135. Romanian, Moldavian, ... The necessary models can be downloaded using the following code: %%bash polyglot download morph2.en morph2.ar [polyglot_data] Downloading package morph2.en to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package morph2.en is already up-to-date! [polyglot_data] Downloading package morph2.ar to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package morph2.ar is already up-to-date! Consider this example that obtains output from a polyglot: from polyglot.text import Text, Word tokens =["unconditional" ,"precooked", "impossible", "painful", "entered"] for s in tokens: s=Word(s, language="en") print("{:<20}{}".format(s,s.morphemes)) unconditional ['un','conditional'] precooked ['pre','cook','ed'] impossible ['im','possible'] painful ['pain','ful'] entered ['enter','ed'] If tokenization is not performed properly, then we can perform morphological analysis for the process of splitting text into its original constituents: sent="Ihopeyoufindthebookinteresting" para=Text(sent) para.language="en" para.morphemes WordList(['I','hope','you','find','the','book','interesting']) A morphological analyzers Morphological analysis may be defined as the process of obtaining grammatical information about a token given its suffix information. Morphological analysis can be performed in three ways: Morpheme-based morphology (or the item and arrangement approach), Lexeme-based morphology (or the item and process approach), and Word-based morphology (or the word and paradigm approach). A morphological analyzer may be defined as a program that is responsible for the analysis of the morphology of a given input token. It analyzes a given token and generates morphological information, such as gender, number, class, and so on, as an output. In order to perform morphological analysis on a given non-whitespace token, pyEnchant dictionary is used. Consider the following code that performs morphological analysis: >>> import enchant >>> s = enchant.Dict("en_US") >>> tok=[] >>> def tokenize(st1): if not st1:return for j in xrange(len(st1),-1,-1): if s.check(st1[0:j]): tok.append(st1[0:i]) st1=st[j:] tokenize(st1) break >>> tokenize("itismyfavouritebook") >>> tok ['it', 'is', 'my','favourite','book'] >>> tok=[ ] >>> tokenize("ihopeyoufindthebookinteresting") >>> tok ['i','hope','you','find','the','book','interesting'] We can determine the category of a word as follows: Morphological hints: Suffix information helps us to detect the category of a word. For example, -ness and –ment suffixes exist with nouns. Syntactic hints: Contextual information is conducive in determining the category of a word. For example, if we have found a word that has a noun category, then syntactic hints will be useful in determining whether an adjective will appear before the noun or after the noun category. Semantic hints: A semantic hint is also useful in determining the category of a word. For example, if we already know that a word represents the name of a location, then it will fall under the noun category. Open class: This refers to the class of words that are not fixed and each day, their number keeps on increasing whenever a new word is added to the list. Words in an open class are usually in the form of nouns. Prepositions are mostly a closed class. Morphology captured by the part of speech tagset: Part of Speech tagset capture information that helps us to perform morphology. For example, the word 'plays' would appear with the third person and singular noun. Omorfi (the open morphology of Finnish) is a package that has been licensed by version 3 of GNU GPL. It is used for the purpose of performing numerous tasks such as language modeling, morphological analysis, rule-based machine translations, information retrieval, statistical machine translations, morphological segmentation, ontologies, and spell checking and correction. A morphological generator A morphological generator is a program that performs the task of morphological generations. Morphological generation may be considered the opposite of morphological analysis. Here, given the description of a word in terms of its number, category, stem, and so on, the original word is retrieved. For example, if root = go, Part of Speech = verb, tense= present, and if it occurs along with a third person and singular subject, then the morphological generator would generate its surface form, that is, goes. There are many Python-based software that perform morphological analysis and generation. Some of them are as follows: ParaMorfo: This is used to perform the morphological generation and analysis of Spanish and Guarani nouns, adjectives, and verbs HornMorpho: This is used for the morphological generation and analysis of Oromo and Amharic nouns and verbs as well as Tigrinya verbs AntiMorfo: This is used for the morphological generation and analysis of Quechua adjectives, verbs, and nouns as well as Spanish verbs MorfoMelayu: This is used for the morphological analysis of Malay words Other examples of software that is used to perform morphological analysis and generation are as follows: Morph is a morphological generator and analyzer for the English language and the RASP system Morphy is a morphological generator, analyzer, and POS tagger for German Morphisto is a morphological generator and analyzer for German Morfette performs supervised learning (inflectional morphology) for Spanish and French Search engines PyStemmer 1.0.1 consists of Snowball stemming algorithms that are conducive for performing information retrieval tasks and the construction of a search engine. It consists of the Porter stemming algorithm and many other stemming algorithms that are useful for the purpose of performing stemming and information retrieval tasks in many languages, including many European languages. We can construct a vector space search engine by converting the texts into vectors. Here are the steps needed to construct a vector space search engine: Stemming and elimination of stop words. A stemmer is a program that accepts words and converts them into stems. Tokens that have same stem almost have the same meanings. Stop words are also eliminated from text. Consider the following code for the removal of stop words and tokenization: def eliminatestopwords(self,list): " " " Eliminate words which occur often and have not much significance from context point of view. " " " return[ word for word in list if word not in self.stopwords ] def tokenize(self,string): " " " Perform the task of splitting text into stop words and tokens " " " Str=self.clean(str) Words=str.split(" ") return [self.stemmer.stem(word,0,len(word)-1) for word in words] Mapping keywords into vector dimensions.Here's the code required to perform the mapping of keywords into vector dimensions: def obtainvectorkeywordindex(self, documentList): " " " In the document vectors, generate the keyword for the given position of element " " " #Perform mapping of text into strings vocabstring = " ".join(documentList) vocablist = self.parser.tokenise(vocabstring) #Eliminate common words that have no search significance vocablist = self.parser.eliminatestopwords(vocablist) uniqueVocablist = util.removeDuplicates(vocablist) vectorIndex={} offset=0 #Attach a position to keywords that performs mapping with dimension that is used to depict this token for word in uniqueVocablist: vectorIndex[word]=offset offset+=1 return vectorIndex #(keyword:position) Mapping of text strings to vectorsHere, a simple term count model is used. The code to convert text strings into vectors is as follows: def constructVector(self, wordString): # Initialise the vector with 0's Vector_val = [0] * len(self.vectorKeywordIndex) tokList = self.parser.tokenize(tokString) tokList = self.parser.eliminatestopwords(tokList) for word in toklist: vector[self.vectorKeywordIndex[word]] += 1; # simple Term Count Model is used return vector Searching similar documents By finding the cosine of an angle between the vectors of a document, we can prove whether two given documents are similar or not. If the cosine value is 1, then the angle value is 0 degrees and vectors are said to be parallel (this means that documents are related). If the cosine value is 0 and the value of the angle is 90 degrees, then vectors are said to be perpendicular (this means that documents are not related). This is the code to compute the cosine between the text vector using scipy: def cosine(vec1, vec2): """ cosine = ( X * Y ) / ||X|| x ||Y|| """ return float(dot(vec1,vec2) / (norm(vec1) * norm(vec2))) Search keywords We perform the mapping of keywords to a vector space. We construct a temporary text that represents items to be searched and then compare it with document vectors with the help of a cosine measurement. Here is the following code needed to search for the vector space: def searching(self,searchinglist): """ search for text that are matched on the basis of list of items """ askVector = self.buildQueryVector(searchinglist) ratings = [util.cosine(askVector, textVector) for textVector in self.documentVectors] ratings.sort(reverse=True) return ratings The following code can be used to detect languages from a source text: >>> import nltk >>> import sys >>> try: from nltk import wordpunct_tokenize from nltk.corpus import stopwords except ImportError: print( 'Error has occured') #---------------------------------------------------------------------- >>> def _calculate_languages_ratios(text): """ Compute probability of given document that can be written in different languages and give a dictionary that appears like {'german': 2, 'french': 4, 'english': 1} """ languages_ratios = {} ''' nltk.wordpunct_tokenize() splits all punctuations into separate tokens wordpunct_tokenize("I hope you like the book interesting .") [' I',' hope ','you ','like ','the ','book' ,'interesting ','.'] ''' tok = wordpunct_tokenize(text) wor = [word.lower() for word in tok] # Compute occurence of unique stopwords in a text for language in stopwords.fileids(): stopwords_set = set(stopwords.words(language)) words_set = set(words) common_elements = words_set.intersection(stopwords_set) languages_ratios[language] = len(common_elements) # language "score" return languages_ratios #---------------------------------------------------------------- >>> def detect_language(text): """ Compute the probability of given text that is written in different languages and obtain the one that is highest scored. It makes use of stopwords calculation approach, finds out unique stopwords present in a analyzed text. """ ratios = _calculate_languages_ratios(text) most_rated_language = max(ratios, key=ratios.get) return most_rated_language if __name__=='__main__': text = ''' All over this cosmos, most of the people believe that there is an invisible supreme power that is the creator and the runner of this world. Human being is supposed to be the most intelligent and loved creation by that power and that is being searched by human beings in different ways into different things. As a result people reveal His assumed form as per their own perceptions and beliefs. It has given birth to different religions and people are divided on the name of religion viz. Hindu, Muslim, Sikhs, Christian etc. People do not stop at this. They debate the superiority of one over the other and fight to establish their views. Shrewd people like politicians oppose and support them at their own convenience to divide them and control them. It has intensified to the extent that even parents of a new born baby teach it about religious differences and recommend their own religion superior to that of others and let the child learn to hate other people just because of religion. Jonathan Swift, an eighteenth century novelist, observes that we have just enough religion to make us hate, but not enough to make us love one another. The word 'religion' does not have a derogatory meaning - A literal meaning of religion is 'A personal or institutionalized system grounded in belief in a God or Gods and the activities connected with this'. At its basic level, 'religion is just a set of teachings that tells people how to lead a good life'. It has never been the purpose of religion to divide people into groups of isolated followers that cannot live in harmony together. No religion claims to teach intolerance or even instructs its believers to segregate a certain religious group or even take the fundamental rights of an individual solely based on their religious choices. It is also said that 'Majhab nhi sikhata aaps mai bair krna'. But this very majhab or religion takes a very heinous form when it is misused by the shrewd politicians and the fanatics e.g. in Ayodhya on 6th December, 1992 some right wing political parties and communal organizations incited the Hindus to demolish the 16th century Babri Masjid in the name of religion to polarize Hindus votes. Muslim fanatics in Bangladesh retaliated and destroyed a number of temples, assassinated innocent Hindus and raped Hindu girls who had nothing to do with the demolition of Babri Masjid. This very inhuman act has been presented by Taslima Nasrin, a Banglsdeshi Doctor-cum-Writer in her controversial novel 'Lajja' (1993) in which, she seems to utilizes fiction's mass emotional appeal, rather than its potential for nuance and universality. ''' >>> language = detect_language(text) >>> print(language) The preceding code will search for stop words and detect the language of the text, which is English. Summary In this article, we discussed stemming, lemmatization, and morphological analysis and generation. Resources for Article: Further resources on this subject: How is Python code organized[article] Machine learning and Python – the Dream Team[article] Putting the Fun in Functional Python[article]

0
0
4569

article-image-hbase-administration-performance-tuning

Packt

21 Aug 2012

8 min read

HBase Administration, Performance Tuning

Packt

21 Aug 2012

8 min read

Setting up Hadoop to spread disk I/O Modern servers usually have multiple disk devices to provide large storage capacities. These disks are usually configured as RAID arrays, as their factory settings. This is good for many cases but not for Hadoop. The Hadoop slave node stores HDFS data blocks and MapReduce temporary files on its local disks. These local disk operations benefit from using multiple independent disks to spread disk I/O. In this recipe, we will describe how to set up Hadoop to use multiple disks to spread its disk I/O. Getting ready We assume you have multiple disks for each DataNode node. These disks are in a JBOD (Just a Bunch Of Disks) or RAID0 configuration. Assume that the disks are mounted at /mnt/d0, /mnt/d1, …, /mnt/dn, and the user who starts HDFS has write permission on each mount point. How to do it... In order to set up Hadoop to spread disk I/O, follow these instructions: On each DataNode node, create directories on each disk for HDFS to store its data blocks: hadoop$ mkdir -p /mnt/d0/dfs/datahadoop$ mkdir -p /mnt/d1/dfs/data…hadoop$ mkdir -p /mnt/dn/dfs/data Add the following code to the HDFS configuration file (hdfs-site.xml): hadoop@master1$ vi $HADOOP_HOME/conf/hdfs-site.xml <property> <name>dfs.data.dir</name> <value>/mnt/d0/dfs/data,/mnt/d1/dfs/data,...,/mnt/dn/dfs/data</value> </property> Sync the modified hdfs-site.xml file across the cluster: hadoop@master1$ for slave in `cat $HADOOP_HOME/conf/slaves`do rsync -avz $HADOOP_HOME/conf/ $slave:$HADOOP_HOME/conf/done Restart HDFS: hadoop@master1$ $HADOOP_HOME/bin/stop-dfs.shhadoop@master1$ $HADOOP_HOME/bin/start-dfs.sh How it works... We recommend JBOD or RAID0 for the DataNode disks, because you don't need the redundancy of RAID, as HDFS ensures its data redundancy using replication between nodes. So, there is no data loss when a single disk fails. Which one to choose, J BOD or RAID0? You will theoretically get better performance from a JBOD configuration than from a RAID configuration. This is because, in a RAID configuration, you have to wait for the slowest disk in the array to complete before the entire write operation can complete, which makes the average I/O time equivalent to the slowest disk's I/O time. In a JBOD configuration, operations on a faster disk will complete independently of the slower ones, which makes the average I/O time faster than the slowest one. However, enterprise-class RAID cards might make big differences. You might want to benchmark your JBOD and RAID0 configurations before deciding which one to go with. For both JBOD and RAID0 configurations, you will have the disks mounted at different paths. The key point here is to set the dfs.data.dirproperty to all the directories created on each disk. The dfs.data.dirproperty specifies where the DataNode should store its local blocks. By setting it to comma-separated multiple directories, DataNode stores its blocks across all the disks in round robin fashion. This causes Hadoop to efficiently spread disk I/O to all the disks. Warning Do not leave blanks between the directory paths in the dfs.data.dir property value, or it won't work as expected. You will need to sync the changes across the cluster and restart HDFS to apply them. There's more... If you run MapReduce, as MapReduce stores its temporary files on TaskTracker's local file system, you might also like to set up MapReduce to spread its disk I/O: On each TaskTracker node, create directories on each disk for MapReduce to store its intermediate data files: hadoop$ mkdir -p /mnt/d0/mapred/localhadoop$ mkdir -p /mnt/d1/mapred/local…hadoop$ mkdir -p /mnt/dn/mapred/local Add the following to MapReduce's configuration file (mapred-site.xml): hadoop@master1$ vi $HADOOP_HOME/conf/mapred-site.xml <property> <name>mapred.local.dir</name> <value>/mnt/d0/mapred/local,/mnt/d1/mapred/local,...,/mnt/dn/mapred/local</value> </property> Sync the modified mapred-site.xml file across the cluster and restart MapReduce. MapReduce generates a lot of temporary files on TaskTrackers' local disks during its execution. Like HDFS, setting up multiple directories on different disks helps spread MapReduce disk I/O significantly. Using network topology script to make Hadoop rack-aware Hadoop has the concept of "Rack Awareness ". Administrators are able to define the rack of each DataNode in the cluster. Making Hadoop rack-aware is extremely important because: Rack awareness prevents data loss Rack awareness improves network performance In this recipe, we will describe how to make Hadoop rack-aware and why it is important. Getting ready You will need to know the rack to which each of your slave nodes belongs. Log in to the master node as the user who started Hadoop. How to do it... The following steps describe how to make Hadoop rack-aware: Create a topology.sh script and store it under the Hadoop configuration directory. Change the path for topology.data, in line 3, to fit your environment: hadoop@master1$ vi $HADOOP_HOME/conf/topology.sh while [ $# -gt 0 ] ; do nodeArg=$1 exec< /usr/local/hadoop/current/conf/topology.data result="" while read line ; do ar=( $line ) if [ "${ar[0]}" = "$nodeArg" ] ; then result="${ar[1]}" fi done shift if [ -z "$result" ] ; then echo -n "/default/rack " else echo -n "$result " fi done Don't forget to set the execute permission on the script file: hadoop@master1$ chmod +x $HADOOP_HOME/conf/topology.sh Create a topology.data file, as shown in the following snippet; change the IP addresses and racks to fit your environment: hadoop@master1$ vi $HADOOP_HOME/conf/topology.data10.161.30.108 /dc1/rack110.166.221.198 /dc1/rack210.160.19.149 /dc1/rack3 Add the following to the Hadoop core configuration file (core-site.xml): hadoop@master1$ vi $HADOOP_HOME/conf/core-site.xml <property> <name>topology.script.file.name</name> <value>/usr/local/hadoop/current/conf/topology.sh</value> </property> Sync the modified files across the cluster and restart HDFS and MapReduce. Make sure HDFS is now rack-aware. If everything works well, you should be able to find something like the following snippet in your NameNode log file: 2012-03-10 13:43:17,284 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack3/10.160.19.149:50010 2012-03-10 13:43:17,297 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack1/10.161.30.108:50010 2012-03-10 13:43:17,429 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack2/10.166.221.198:50010 Make sure MapReduce is now rack-aware. If everything works well, you should be able to find something like the following snippet in your JobTracker log file: 2012-03-10 13:50:38,341 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack3/ip-10-160-19-149.us-west-1.compute.internal 2012-03-10 13:50:38,485 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack1/ip-10-161-30-108.us-west-1.compute.internal 2012-03-10 13:50:38,569 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack2/ip-10-166-221-198.us-west-1.compute.internal How it works... The following diagram shows the concept of Hadoop rack awareness: Each block of the HDFS files will be replicated to multiple DataNodes, to prevent loss of all the data copies due to failure of one machine. However, if all copies of data happen to be replicated on DataNodes in the same rack, and that rack fails, all the data copies will be lost. So to avoid this, the NameNode needs to know the network topology in order to use that information to make intelligent data replication. As shown in the previous diagram, with the default replication factor of three, two data copies will be placed on the machines in the same rack, and another one will be put on a machine in a different rack. This ensures that a single rack failure won't result in the loss of all data copies. Normally, two machines in the same rack have more bandwidth and lower latency between them than two machines in different racks. With the network topology information, Hadoop is able to maximize network performance by reading data from proper DataNodes. If data is available on the local machine, Hadoop will read data from it. If not, Hadoop will try reading data from a machine in the same rack, and if it is available on neither, data will be read from machines in different racks. In step 1, we create a topology.sh script. The script takes DNS names as arguments and returns network topology (rack) names as the output. The mapping of DNS names to network topology is provided by the topology.data file, which was created in step 2. If an entry is not found in the topology.data file, the script returns /default/rack as a default rack name. Note that we use IP addresses, and not hostnames in the topology. data file. There is a known bug that Hadoop does not correctly process hostnames that start with letters "a" to "f". Check HADOOP-6682 for more details. In step 3, we set the topology.script.file.name property in core-site.xml, telling Hadoop to invoke topology.sh to resolve DNS names to network topology names. After restarting Hadoop, as shown in the logs of steps 5 and 6, HDFS and MapReduce add the correct rack name as a prefix to the DNS name of slave nodes. This indicates that the HDFS and MapReduce rack awareness work well with the aforementioned settings.

0
0
4569

article-image-customizing-heat-maps-intermediate

Packt

22 Feb 2016

11 min read

Customizing heat maps (Intermediate)

Packt

22 Feb 2016

11 min read

This article will help you explore more advanced functions to customize the layout of the heat maps. The main focus lies on the usage of different color palettes, but we will also cover other useful features, such as cell notes that will be used in this recipe. (For more resources related to this topic, see here.) To ensure that our heat maps look good in any situation, we will make use of different color palettes in this recipe, and we will even learn how to create our own. Further, we will add some more extras to our heat maps including visual aids such as cell note labels, which will make them even more useful and accessible as a tool for visual data analysis. The following image shows a heat map with cell notes and an alternative color palette created from the arabidopsis_genes.csv data set: Getting ready Download the 5644OS_03_01.r script and the Arabidopsis_genes.csv data set from your account at http://www.packtpub.com and save it to your hard drive. I recommend that you save the script and data file to the same folder on your hard drive. If you execute the script from a different location to the data file, you will have to change the current R working directory accordingly. The script will check automatically if any additional packages need to be installed in R. How to do it... Execute the following code in R via the 5644OS_03_01.r script and take a look at the PDF file custom_heatmaps.pdf that will be created in the current working directory: ### loading packages if (!require("gplots")) { install.packages("gplots", dependencies = TRUE) library(RColorBrewer) } if (!require("RColorBrewer")) { install.packages("RColorBrewer", dependencies = TRUE) library(RColorBrewer) } ### reading in data gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names ### setting heatmap.2() default parameters heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) pdf("custom_heatmaps.pdf") ### 1) customizing colors # 1.1) in-built color palettes heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") # 1.2) RColorBrewer palettes heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") # 1.3) creating own color palettes my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") # 1.4) gray scale heat2(col = gray(level = (0:100)/100), main ="1.4) Gray Scale") ### 2) adding cell notes fold_change <- 2^gene_data rounded_fold_changes <- round(rounded_fold_changes, 2) heat2(cellnote = rounded, notecex = 0.5, notecol = "black", col = my_palette, main = "2) Cell Notes") ### 3) adding column side colors heat2(ColSideColors = c("red", "gray", "red", rep("green",13)), main = "3) ColSideColors") dev.off() How it works... Primarily, we will be using read.csv() and heatmap.2() to read in data into R and construct our heat maps. In this recipe, however, we will focus on advanced features to enhance our heat maps, such as customizing color and other visual elements: Inspecting the arabidopsis_genes.csv data set: The arabidopsis_genes.csv file contains a compilation of gene expression data from the model plant Arabidopsis thaliana. I obtained the freely available data of 16 different genes as log 2 ratios of target and reference gene from the Arabidopsis eFP Browser (http://bar.utoronto.ca/efp_arabidopsis/). For each gene, expression data of 47 different areas of the plant is available in this data file. Reading the data and converting it into a numeric matrix: We have to convert the data table into a numeric matrix first before we can construct our heat maps: gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names Creating a customized heatmap.2() function: To reduce typing efforts, we are defining our own version of the heatmap.2() function now, where we will include some arguments that we are planning to keep using throughout this recipe: heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) So, each time we call our newly defined heat2() function, it will behave similar to the heatmap.2() function, except for the additional arguments that we will pass along. We also include a new argument, black, for the tracecol parameter, to better distinguish the density plot in the color key from the background. The built-in color palettes: There are four more color palettes available in the base R that we could use instead of the heat.colors palette: rainbow, terrain.colors, topo.colors, and cm.colors. So let us make use of the terrain.colors color palette now, which will give us a nice color transition from green over yellow to rose: heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") Every number for the parameter n that is larger than the default value 12 will add additional colors, which will make the transition smoother. A value of 1000 for the n parameter should be more than sufficient to make the transition between the individual colors indistinguishable to the human eye. The following image shows a side-by-side comparison of the heat.colors and terrain.colors color palettes using a different number of color shades: Further, it is also possible to reverse the direction of the color transition. For example, if we want to have a heat.color transition from yellow to red instead of red to yellow in our heat map, we could simply define a reverse function: rev_heat.colors <- function(x) rev(heat.colors(x)) heat2(col = rev_heat.colors(500)) RColorBrewer palettes: A lot of color palettes are available from the RColorBrewer package. To see how they look like, you can type display.brewer.all() into the R command-line after loading the RColorBrewer package. However, in contrast to the dynamic range color palettes that we have seen previously, the RColorBrewer palettes have a distinct number of different colors. So to select all nine colors from the YlOrRd palette, a gradient from yellow to red, we use the following command: heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") The following image gives you a good overview of all the different color palettes that are available from the RColorBrewer package: Creating our own color palettes: Next, we will see how we can create our own color palettes. A whole bunch of different colors are already defined in R. An overview of those colors can be seen by typing colors() into the command line of R. The most convenient way to assign new colors to a color palette is using hex colors (hexadecimal colors). Many different online tools are freely available that allow us to obtain the necessary hex codes. A great example is color picker (http://www.colorpicker.com), which allows us to choose from a rich color table and provides us with the corresponding hex codes. Once we gather all the hexadecimal codes for the colors that we want to use for our color palette, we can assign them to a variable as we have done before with the explicit color names: my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") This is a very handy approach for creating a color key with very distinct colors. However, the downside of this method is that we have to provide a lot of different colors if we want to create a smooth color gradient; we have used 1000 different colors for the terrain.color() palette to get a smooth transition in the color key! Using colorRampPalette for smoother color gradients: A convenient approach to create a smoother color gradient is to use the colorRampPalette() function, so we don't have to insert all the different colors manually. The function takes a vector of different colors as an argument. Here, we provide three colors: blue for the lower end of the color key, yellow for the middle range, and red for the higher end. As we did it for the in-built color palettes, such as heat.color, we assign the value 1000 to the n parameter: my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") In this case, it is more convenient to use discrete color names over hex colors, since we are using the colorRampPalette() function to create a gradient and do not need all the different shades of a particular color. Grayscales: It might happen that the medium or device that we use to display our heat maps does not support colors. Under these circumstances, we can use the gray palette to create a heat map that is optimized for those conditions. The level parameter of the gray() function takes a vector with values between 0 and 1 as an argument, where 0 represents black and 1 represents white, respectively. For a smooth gradient, we use a vector with 100 equally spaced shades of gray ranging from 0 to 1. heat2(col = gray(level = (0:200)/200), main ="1.4) Gray Scale") We can make use of the same color palettes for the levelplot() function too. It works in a similar way as it did for the heatmap.2() function that we are using in this recipe. However, inside the levelplot() function call, we must use col.regions instead of the simple col, so that we can include a color palette argument. Adding cell notes to our heat map: Sometimes, we want to show a data set along with our heat map. A neat way is to use so-called cell notes to display data values inside the individual heat map cells. The underlying data matrix for the cell notes does not necessarily have to be the same numeric matrix we used to construct our heat map, as long as it has the same number of rows and columns. As we recall, the data we read from arabidopsis_genes.csv resembles log 2 ratios of sample and reference gene expression levels. Let us calculate the fold changes of the gene expression levels now and display them—rounded to two digits after the decimal point—as cell notes on our heat map: fold_change <- 2^gene_data rounded_fold_changes <- round(fold_change, 2) heat2(cellnote = rounded_fold_changes, notecex = 0.5, notecol = "black", col = rev_heat.colors, main = "Cell Notes") The notecex parameter controls the size of the cell notes. Its default size is 1, and every argument between 0 and 1 will make the font smaller, whereas values larger than 1 will make the font larger. Here, we decreased the font size of the cell notes by 50 percent to fit it into the cell boundaries. Also, we want to display the cell notes in black to have a nice contrast to the colored background; this is controlled by the notecol parameter. Row and column side colors: Another approach to pronounce certain regions, that is, rows or columns on the heat map is to make use of row and column side colors. The ColSideColors argument will place a colored box between the dendrogram and heat map that can be used to annotate certain columns. We pass our vector with colors to ColSideColors, where its length must be equal to the number of columns of the heat map. Here, we want to color the first and third column red, the second one gray, and all the remaining 13 columns green: heat2(ColSideColors = c("red", "gray", "red", rep("green", 13)), main = "ColSideColors") You can see in the following image how the column side colors look like when we include the ColSideColors argument as shown previously: Attentive readers may have noticed that the order of colors in the column color box slightly differs from the order of colors we passed as a vector to ColSideColors. We see red two times next to each other, followed by a green and a gray box. This is due to the fact that the columns of our heat map have been reordered by the hierarchical clustering algorithm. Summary To learn more about the similar technology, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Instant R Starter (https://www.packtpub.com/big-data-and-business-intelligence/instant-r-starter-instant) Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition) Mastering RStudio – Develop, Communicate, and Collaborate with R (https://www.packtpub.com/application-development/mastering-rstudio-%E2%80%93-develop-communicate-and-collaborate-r) Resources for Article: Further resources on this subject: Data Analysis Using R[article] Big Data Analysis[article] Big Data Analysis (R and Hadoop)[article]

0
0
4540

article-image-analyzing-social-networks-facebook

Packt

20 Jun 2017

15 min read

Analyzing Social Networks with Facebook

Packt

20 Jun 2017

15 min read

In this article by Raghav Bali, Dipanjan Sarkar and Tushar Sharma, the authors of the book Learning Social Media Analytics with R, we got a good flavor of the various aspects related to the most popular social micro-blogging platform, Twitter. In this article, we will look more closely at the most popular social networking platform, Facebook. With more than 1.8 billion monthly active users, over 18 billion dollars annual revenue and record breaking acquisitions for popular products including Oculus, WhatsApp and Instagram have truly made Facebook the core of the social media network today. (For more resources related to this topic, see here.) Before we put Facebook data under the microscope, let us briefly look at Facebook’s interesting origins! Like many popular products, businesses and organizations, Facebook too had a humble beginning. Originally starting off as Mark Zuckerberg’s brainchild in 2004, it was initially known as “Thefacebook” located at thefacebook.com, which was branded as an online social network, connecting university and college students. While this social network was only open to Harvard students in the beginning, it soon expanded within a month by including students from other popular universities. In 2005, the domain facebook.com was finally purchased and “Facebook” extended its membership to employees of companies and organizations for the first time. Finally in 2006, Facebook was finally opened to everyone above 13 years of age and having a valid email address. The following snapshot shows us how the look and feel of the Facebook platform has evolved over the years! Facebook’s evolving look over time While Facebook has a primary website, also known as a web application, it has also launched mobile applications for the major operating systems on handheld devices. In short, Facebook is not just a social network website but an entire platform including a huge social network of connected people and organizations through friends, followers and pages. We will leverage Facebook’s social “Graph API” to access actual Facebook data to perform various analyses. Users, brands, business, news channels, media houses, retail stores and many more are using Facebook actively on a daily basis for producing and consuming content. This generates vast amount of data and a substantial amount of this is available to users through its APIs. From a social media analytics perspective, this is really exciting because this treasure trove of data with easy to access APIs and powerful open source libraries from R, gives us enormous potential and opportunity to get valuable information from analyzing this data in various ways. We will follow a structured path in this article and cover the following major topics sequentially to ensure that you do not get overwhelmed with too much content at once. Accessing Facebook data Analyzing your personal social network Analyzing an English football social network Analyzing English football clubs’ brand page engagements We will use libraries like Rfacebook, igraph and ggplot2 to retrieve, analyze and visualize data from Facebook. All the following sections of the book assume that you have a Facebook account which is necessary to access data from the APIs and analyze it. In case you do not have an account, do not despair. You can use the data and code files for this article to follow along with the hands-on examples to gain a better understanding of the concepts of social network and engagement analysis. Accessing Facebook data You will find a lot of content in several books and on the web about various techniques to access and retrieve data from Facebook. There are several official ways of doing this which include using the Facebook Graph API either directly through low level HTTP based calls or indirectly through higher level abstract interfaces belonging to libraries like Rfacebook. Some alternate ways of retrieving Facebook data would be to use registered applications on Facebook like Netvizz or the GetNet application built by Lada Adamic, used in her very popular “Social Network Analysis” course (Unfortunately http://snacourse.com/getnet is not working since Facebook completely changed its API access permissions and privacy settings). Unofficial ways include techniques like web scraping and crawling to extract data. Do note though that Facebook considers this to be a violation of its terms and conditions of accessing data and you should try and avoid crawling Facebook for data especially if you plan to use it for commercial purposes. In this section, we will take a closer look at the Graph API and the Rfacebook package in R. The main focus will be on how you can extract data from Facebook using both of them. Understanding the Graph API To start using the Graph API, you would need to have an account on Facebook to be able to use the API. You can access the API in various ways. You can create an application on Facebook by going to https://developers.facebook.com/apps/ and then create a long-lived OAuth access token using the fbOAuth(…)function from the Rfacebook package. This enables R to make calls to the Graph API and you can also store this token on the disk and load it for future use. An easier way is to create a short-lived token which would let you access the API data for about two hours by going to the Facebook Graph API Explorer page which is available at https://developers.facebook.com/tools/explorer and get a temporary access token from there. The following snapshot depicts how to get an access token for the Graph API from Facebook. Facebook’s Graph API explorer On clicking “Get User Access Token” in the above snapshot, it will present a list of checkboxes with various permissions which you might need for accessing data including user data permissions, events, groups and pages and other miscellaneous permissions. You can select the ones you need and click on the “Get Access Token” button in the prompt. This will generate a new access token the field depicted in the above snapshot and you can directly copy and use it to retrieve data in R. Before going into that, we will take a closer look at the Graph API explorer which directly allows you to access the API from your web browser itself and helps if you want to do some quick exploratory analysis. A part of it is depicted in the above snapshot. The current version of the API when writing this book is v2.8 which you can see in the snapshot beside the GET resource call. Interestingly, the Graph API is so named because Facebook by itself can be considered as a huge social graph where all the information can be classified into the following three categories. Nodes: These are basically users, pages, photos and so on. Nodes indicate a focal point of interest which is connected to other points. Edges: These connect various nodes together forming the core social graph and these connections are based on various relations like friends, followers and so on. Fields: These are specific attributes or properties about nodes, an example would be a user’s address, birthday, name and so on. Like we mentioned before, the API is HTTP based and you can make HTTPGET requests to nodes or edges and all requests are passed to graph.facebook.com to get data. Each node usually has a specific identifier and you can use it for querying information about a node as depicted in the following snippet. GET graph.facebook.com /{node-id} You can also use edge names in addition to the identifier to get information about the edges of the node. The following snippet depicts how you can do the same. GET graph.facebook.com /{node-id}/{edge-name} The following snapshot shows us how we can get information about our own profile. Querying your details in the Graph API explorer Now suppose, I wanted to retrieve information about a Facebook page,“Premier League” which represents the top tier competition in English Football using its identifier and also take a look at its liked pages. I can do the same using the following request. Querying information about a Facebook Page using the Graph API explorer Thus from the above figure, you can clearly see the node identifier, page name and likes for the page, “Premier League”. It must be clear by now that all API responses are returned in the very popular JSON format which is easy to parse and format as needed for analysis. Besides this, there also used to be another way of querying the social graph in Facebook, which was known as FQL or Facebook Query Language, an SQL like interface for querying and retrieving data. Unfortunately, Facebook seems to have deprecated its use and hence covering it would be out of our present scope. Now that you have a firm grasp on the syntax of the Graph API and have also seen a few examples of how to retrieve data from Facebook, we will take a closer look at the Rfacebook package. Understanding Rfacebook Since we will be accessing and analyzing data from Facebook using R, it makes sense to have some robust mechanism to directly query Facebook and retrieve data instead of going to the browser every time like we did in the earlier section. Fortunately, there is an excellent package in R called Rfacebook which has been developed by Pablo Barberá. You can either install it from CRAN or get its most updated version from GitHub. The following snippet depicts how you can do the same. Remember you might need to install the devtools package if you don’t have it already, to download and install the latest version of the Rfacebook package from GitHub. install.packages("Rfacebook") # install from CRAN # install from GitHub library(devtools) install_github("pablobarbera/Rfacebook/Rfacebook") Once you install the package, you can load up the package using load(Rfacebook) and start using it to retrieve data from Facebook by using the access token you generated earlier. The following snippet shows us how you can access your own details like we had mentioned in the previous section, but this time by using R. > token = 'XXXXXX' > me <- getUsers("me", token=token) > me$name [1] "Dipanjan Sarkar" > me$id [1] "1026544" The beauty of this package is that you directly get the results in curated and neatly formatted data frames and you do not need to spend extra time trying to parse the raw JSON response objects from the Graph API. The package is well documented and has high level functions for accessing personal profile data on Facebook as well as page and group level data points. We will now take a quick look at Netvizz a Facebook application, which can also be used to extract data easily from Facebook. Understanding Netvizz The Netvizz application was developed by Bernhard Rieder and is a tool which can be used to extract data from Facebook pages, groups, get statistics about links and also extract social networks from Facebook pages based on liked pages from each connected page in the network. You can access Netvizz at https://apps.facebook.com/netvizz/ and on registering the application on your profile, you will be able to see the following screen. The Netvizz application interface From the above app snapshot, you can see that there are various links based on the type of operation you want to execute to extract data. Feel free to play around with this tool and we will be using its “page like network” capability later on in one of our analyses in a future section. Data Access Challenges There are several challenges with regards to accessing data from Facebook. Some of the major issues and caveats have been mentioned in the following points: Facebook will keep evolving and updating its data access APIs and this can and will lead to changes and deprecation of older APIs and access patterns just like FQL was deprecated. Scope of data available keeps changing with time and evolving of Facebook’s API and privacy settings. For instance we can no longer get details of all our friends from the API any longer. Libraries and Tools built on top of the API can tend to break with changes to Facebook’s APIs and this has happened before with Rfacebook as well as Netvizz. Besides this, Lada Adamic’s GetNet application has stopped working permanently ever since Facebook changed the way apps are created and the permissions they require. You can get more information about it here http://thepoliticsofsystems.net/2015/01/the-end-of-netvizz/ Thus what was used in the book today for data retrieval might not be working completely tomorrow if there are any changes in the APIs though it is expected it will be working fine for at least the next couple of years. However to prevent any hindrance on analyzing Facebook data, we have provided the datasets we used in most of our analyses except personal networks so that you can still follow along with each example and use-case. Personal names have been anonymized wherever possible to protect their privacy. Now that we have a good idea about Facebook’s Graph API and how to access data, let’s analyze some social networks! Analyzing your personal social network Like we had mentioned before, Facebook by itself is a massive social graph, connecting billions of users, brands and organization. Consider your own Facebook account if you have one. You will have several friends which are your immediate connections, they in turn will be having their own set of friends including you and you might be friends with some of them and so on. Thus you and your friends form the nodes of the network and edges determine the connections. In this section we will analyze a small network of you and your immediate friends and also look at how we can extract and analyze some properties from the network. Before we jump into our analysis, we will start by loading the necessary packages needed which are mentioned in the following snippet and storing the Facebook Graph API access token in a variable. library(Rfacebook) library(gridExtra) library(dplyr) # get the Graph API access token token = ‘XXXXXXXXXXX’ You can refer to the file fb_personal_network_analysis.R for code snippets used in the examples depicted in this section. Basic descriptive statistics In this section, we will try to get some basic information and descriptive statistics on the same from our personal social network on Facebook. To start with let us look at some details of our own profile on Facebook using the following code. # get my personal information me <- getUsers("me", token=token, private_info = TRUE) > View(me[c('name', 'id', 'gender', 'birthday')]) This shows us a few fields from the data frame containing our personal details retrieved from Facebook. We use the View function which basically invokes a spreadsheet-style data viewer on R objects like data frames. Now, let us get information about our friends in our personal network. Do note that Facebook currently only lets you access information about those friends who have allowed access to the Graph API and hence you may not be able to get information pertaining to all friends in your friend list. We have anonymized their names below for privacy reasons. anonymous_names <- c('Johnny Juel', 'Houston Tancredi',..., 'Julius Henrichs', 'Yong Sprayberry') # getting friends information friends <- getFriends(token, simplify=TRUE) friends$name <- anonymous_names # view top few rows > View(head(friends)) This gives us a peek at some people from our list of friends which we just retrieved from Facebook. Let’s now analyze some descriptive statistics based on personal information regarding our friends like where they are from, their gender and so on. # get personal information friends_info <- getUsers(friends$id, token, private_info = TRUE) # get the gender of your friends >View(table(friends_info$gender)) This gives us the gender of my friends, looks like more male friends have authorized access to the Graph API in my network! # get the location of your friends >View(table(friends_info$location)) This depicts the location of my friends (wherever available) in the following data frame. # get relationship status of your friends > View(table(friends_info$relationship_status)) From the statistics in the following table I can see that a lot of my friends have gotten married over the past couple of years. Boy that does make me feel old! Suppose I want to look at the relationship status of my friends grouped by gender, we can do the same using the following snippet. # get relationship status of friends grouped by gender View(table(friends_info$relationship_status, friends_info$gender)) The following table gives us the desired results and you can see the distribution of friends by their gender and relationship status. Summary This article has been proven very beneficial to know some basic analytics of social networks with the help of R. Moreover, you will also get to know the information regarding the packages that R use. Resources for Article: Further resources on this subject: How to integrate social media with your WordPress website [article] Social Media Insight Using Naive Bayes [article] Social Media in Magento [article]

0
0
4530

article-image-the-us-department-of-commerce-wants-to-regulate-export-of-ai-and-related-products

Prasad Ramesh

21 Nov 2018

4 min read

The US Department of Commerce wants to regulate export of AI and related products

Prasad Ramesh

21 Nov 2018

4 min read

This Monday the Department of Commerce, Bureau of Industry and Security (BIS) published a proposal to control the export of AI from USA. This move seems to lean towards restricting AI tech going out of the country to protect the national security of USA. The areas that come under the licensing proposal Artificial intelligence, as we’ve seen in recent years has great potential for both good and harm. The DoC in the United States of America is not taking any chances with it. The proposal lists many areas of AI that could potentially require a license to be exported to certain countries. Other than computer vision, natural language processing, military-specific products like adaptive camouflage and faceprint for surveillance is also listed in the proposal to restrict the export of AI. The areas major areas listed in the proposal are: Biotechnology including genomic and genetic engineering Artificial intelligence (AI) and machine learning including neural networks, computer vision, and natural language processing Position, Navigation, and Timing (PNT) technology Microprocessor technology like stacked memory on chip Advanced computing technology like memory-centric logic Data analytics technology like data analytics by visualization and analysis algorithms Quantum information and sensing technology like quantum computing, encryption, and sensing Logistics technology like mobile electric power Additive manufacturing like 3D printing Robotics like micro drones and molecular robotics Brain-computer interfaces like mind-machine interfaces Hypersonics like flight control algorithms Advanced Materials like adaptive camouflage Advanced surveillance technologies faceprint and voiceprint technologies David Edelman, a former adviser to ex-US president Barack Obama said: “This is intended to be a shot across the bow, directed specifically at Beijing, in an attempt to flex their muscles on just how broad these restrictions could be”. Countries that could be affected with regulation on export of AI To determine the level of export controls, the department will consider the potential end-uses and end-users of the technology. The list of countries is not clear but ones to which exports are restricted like embargoed countries will be considered. Also, China could be one of them. What does this mean for companies? If your organization creates products in ‘emerging technologies’ then there will be restrictions on the countries you can export to and also on disclosure of technology to foreign nationals in United States. Depending on the criteria, non-US citizens might even need licenses to participate in research and development of such technology. This will restrict non-US citizens to participate and take back anything from, say an advanced AI research project. If the new regulations go into effect, it will affect the security review of foreign investments across these areas. When the list of technologies is finalized, many types of foreign investments will be subject to a review and deals could be halted or undone. Public views on academic research In addition to commercial applications and products, this regulation could also be bad news for academic research. https://twitter.com/jordanbharrod/status/1065047269282627584 https://twitter.com/BryanAlexander/status/1064941028795400193 Even Google Home, Amazon Alexa, iRobot Roomba could be affected. https://twitter.com/R_D/status/1064511113956655105 But it does not look like research papers will be really affected. The document states that the commerce does not intend to expand jurisdiction on ‘fundamental research’ for ‘emerging technologies’ that is intended to be published and not currently subject to EAR as per § 734.8. But will this affect open-source technologies? We really hope not. Deadline for comments is less than 30 days away BIS has invited comments to the proposal for defining and categorizing emerging technologies, the impact of the controls in US technology leadership among other topics. However the short deadline of December 19, 2018 indicates their haste to implement licensing export of AI quickly. For more details, and to know where you can submit your comments, read the proposal. The US Air Force lays groundwork towards artificial general intelligence based on hierarchical model of intelligence Google open sources BERT, an NLP pre-training technique Teaching AI ethics – Trick or Treat?

0
0
4523

article-image-moving-further-numpy-modules

Packt

23 Jun 2015

23 min read

Moving Further with NumPy Modules

Packt

23 Jun 2015

23 min read

0
0
4499

article-image-getting-started-with-h2o-for-machine-learning

Sugandha Lahoti

01 Dec 2017

7 min read

Getting started with Machine Learning in H2O

Sugandha Lahoti

01 Dec 2017

7 min read

[box type="note" align="" class="" width=""]We present to you an excerpt from our book by Dr. Uday Kamath and Krishna Choppella titled Mastering Java Machine Learning. This book aims to give you an array of advanced techniques on Machine Learning. [/box] Our article given below talks about using H2O as a Machine Learning Platform for Big Data applications. H2O is a leading open source platform for Machine Learning at Big Data scale, with a focus on bringing AI to the enterprise. The company counts several leading lights in statistical learning theory and optimization among its scientific advisors. It supports programming environments in multiple languages. H2O architecture The following figure gives a high-level architecture of H2O with important components. H2O can access data from various data stores such as HDFS, SQL, NoSQL, and Amazon S3, to name a few. The most popular deployment of H2O is to use one of the deployment stacks with Spark or to run it in a H2O cluster itself. The core of H2O is an optimized way of handling Big Data in memory, so that iterative algorithms that go through the same data can be handled efficiently and achieve good performance. Important Machine Learning algorithms in supervised and unsupervised learning are implemented specially to handle horizontal scalability across multiple nodes and JVMs. H2O provides not only its own user interface, known as flow, to manage and run modeling tasks, but also has different language bindings and connector APIs to Java, R, Python, and Scala. Most Machine Learning algorithms, optimization algorithms, and utilities use the concept of fork-join or MapReduce. As shown in the figure below, the entire dataset is considered as a Data Frame in H2O, and comprises vectors, which are features or columns in the dataset. The rows or instances are made up of one element from each Vector arranged side-by-side. The rows are grouped together to form a processing unit known as a Chunk. Several chunks are combined in one JVM. Any algorithmic or optimization work begins by sending the information from the topmost JVM to fork on to the next JVM, then on to the next, and so on, similar to the map operation in MapReduce. Each JVM works on the rows in the chunks to establish the task and finally the results flow back in the reduce operation: Machine learning in H2O The following figure shows all the Machine Learning algorithms supported in H2O v3 for supervised and unsupervised learning: Tools and usage H2O Flow is an interactive web application that helps data scientists to perform various tasks from importing data to running complex models using point and click and wizard-based concepts. H2O is run in local mode as: java –Xmx6g –jar h2o.jar The default way to start Flow is to point your browser and go to the following URL: http://192.168.1.7:54321/. The right-side of Flow captures every user action performed under the tab OUTLINE. The actions taken can be edited and saved as named flows for reuse and collaboration, as shown in the figure below: The figure below shows the interface for importing files from the local filesystem or HDFS and displays detailed summary statistics as well as next actions that can be performed on the dataset. Once the data is imported, it gets a data frame reference in the H2O framework with the extension of .hex. The summary statistics are useful in understanding the characteristics of data such as missing, mean, max, min, and so on. It also has an easy way to transform the features from one type to another, for example, numeric features with a few unique values to categorical/nominal types known as enum in H2O. The actions that can be performed on the datasets are: Visualize the data. Split the data into different sets such as training, validation, and testing. Build supervised and unsupervised models. Use the models to predict. Download and export the files in various formats. Building supervised or unsupervised models in H2O is done through an interactive screen. Every modeling algorithm has its parameters classified into three sections: basic, advanced, and expert. Any parameter that supports hyper-parameter searches for tuning the model has a checkbox grid next to it, and more than one parameter value can be used. Some basic parameters such as training_frame, validation_frame, and response_ column, are common to every supervised algorithm; others are specific to model types, such as the choice of solver for GLM, the activation function for deep learning, and so on. All such common parameters are available in the basic section. Advanced parameters are settings that afford greater flexibility and control to the modeler if the default behavior must be overridden. Several of these parameters are also common across some algorithms—two examples are the choice of method for assigning the fold index (if cross-validation was selected in the basic section), and selecting the column containing weights (if each example is weighted separately), and so on. Expert parameters define more complex elements such as how to handle the missing values, model-specific parameters that need more than a basic understanding of the algorithms, and other esoteric variables. In the figure below, GLM, a supervised learning algorithm, is being configured with 10-fold cross-validation, binomial (two-class) classification, efficient LBFGS optimization algorithm, and stratified sampling for cross-validation split: The model results screen contains a detailed analysis of the results using important evaluation charts, depending on the validation method that was used. At the top of the screen are possible actions that can be taken, such as to run the model on unseen data for prediction, download the model as POJO format, export the results, and so on. Some of the charts are algorithm-specific, like the scoring history that shows how the training loss or the objective function changes over the iterations in GLM—this gives the user insight into the speed of convergence as well as into the tuning of the iterations parameter. We see the ROC curves and the Area Under Curve metric on the validation data in addition to the gains and lift charts, which give the cumulative capture rate and cumulative lift over the validation sample respectively. The figure below shows SCORING HISTORY, ROC CURVE, and GAINS/LIFT charts for GLM on 10-fold cross-validation on the CoverType dataset: The output of validation gives detailed evaluation measures such as accuracy, AUC, err, errors, f1 measure, MCC (Mathews Correlation Coefficient), precision, and recall for each validation fold in the case of cross-validation as well as the mean and standard deviation computed across all. The prediction action runs the model using unseen held-out data to estimate the out-of-sample performance. Important measures such as errors, accuracy, area under curve, ROC plots, and so on, are given as the output of predictions that can be saved or exported. H2O is a rich visualization and analysis framework that is accessible from multiple programming environments( HDFS, SQL, NoSQL, S3, and others). It can also support a number of Machine Learning algorithms that can be run in a cluster. All these factors make it one of the major Machine Learning framework on Big Data. If you think this post is useful, do not miss to check our book Mastering Java Machine Learning to know more on predictive models for batch- and stream-based big data learning using the latest tools and methodologies.

0
0
4471

Packt

24 Sep 2015

11 min read

Integration with Spark SQL

Packt

24 Sep 2015

11 min read

In this article by Sumit Gupta, the author of the book Learning Real-time Processing with Spark Streaming, we will discuss the integration of Spark Streaming with various other advance Spark libraries such as Spark SQL. (For more resources related to this topic, see here.) No single software in today's world can fulfill the varied, versatile, and complex demands/needs of the enterprises, and to be honest, neither should it! Software are made to fulfill specific needs arising out of the enterprises at a particular point in time, which may change in future due to many other factors. These factors may or may not be controlled like government policies, business/market dynamics, and many more. Considering all these factors integration and interoperability of any software system with internal/external systems/software's is pivotal in fulfilling the enterprise needs. Integration and interoperability are categorized as nonfunctional requirements, which are always implicit and may or may not be explicitly stated by the end users. Over the period of time, architects have realized the importance of these implicit requirements in modern enterprises, and now, all enterprise architectures provide support due diligence and provisions in fulfillment of these requirements. Even the enterprise architecture frameworks such as The Open Group Architecture Framework (TOGAF) defines the specific set of procedures and guidelines for defining and establishing interoperability and integration requirements of modern enterprises. Spark community realized the importance of both these factors and provided a versatile and scalable framework with certain hooks for integration and interoperability with the different systems/libraries; for example; data consumed and processed via Spark streams can also be loaded into the structured (table: rows/columns) format and can be further queried using SQL. Even the data can be stored in the form of Hive tables in HDFS as persistent tables, which will exist even after our Spark program has restarted. In this article, we will discuss querying streaming data in real time using Spark SQL. Querying streaming data in real time Spark Streaming is developed on the principle of integration and interoperability where it not only provides a framework for consuming data in near real time from varied data sources, but at the same time, it also provides the integration with Spark SQL where existing DStreams can be converted into structured data format for querying using standard SQL constructs. There are many such use cases where SQL on streaming data is a much needed feature; for example, in our distributed log analysis use case, we may need to combine the precomputed datasets with the streaming data for performing exploratory analysis using interactive SQL queries, which is difficult to implement only with streaming operators as they are not designed for introducing new datasets and perform ad hoc queries. Moreover SQL's success at expressing complex data transformations derives from the fact that it is based on a set of very powerful data processing primitives that do filtering, merging, correlation, and aggregation, which is not available in the low-level programming languages such as Java/ C++ and may result in long development cycles and high maintenance costs. Let's move forward and first understand few things about Spark SQL, and then, we will also see the process of converting existing DStreams into the Structured formats. Understanding Spark SQL Spark SQL is one of the modules developed over the Spark framework for processing structured data, which is stored in the form of rows and columns. At a very high level, it is similar to the data residing in RDBMS in the form rows and columns, and then SQL queries are executed for performing analysis, but Spark SQL is much more versatile and flexible as compared to RDBMS. Spark SQL provides distributed processing of SQL queries and can be compared to frameworks Hive/Impala or Drill. Here are the few notable features of Spark SQL: Spark SQL is capable of loading data from variety of data sources such as text files, JSON, Hive, HDFS, Parquet format, and of course RDBMS too so that we can consume/join and process datasets from different and varied data sources. It supports static and dynamic schema definition for the data loaded from various sources, which helps in defining schema for known data structures/types, and also for those datasets where the columns and their types are not known until runtime. It can work as a distributed query engine using the thrift JDBC/ODBC server or command-line interface where end users or applications can interact with Spark SQL directly to run SQL queries. Spark SQL provides integration with Spark Streaming where DStreams can be transformed into the structured format and further SQL Queries can be executed. It is capable of caching tables using an in-memory columnar format for faster reads and in-memory data processing. It supports Schema evolution so that new columns can be added/deleted to the existing schema, and Spark SQL still maintains the compatibility between all versions of the schema. Spark SQL defines the higher level of programming abstraction called DataFrames, which is also an extension to the existing RDD API. Data frames are the distributed collection of the objects in the form of rows and named columns, which is similar to tables in the RDBMS, but with much richer functionality containing all the previously defined features. The DataFrame API is inspired by the concepts of data frames in R (http://www.r-tutor.com/r-introduction/data-frame) and Python (http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). Let's move ahead and understand how Spark SQL works with the help of an example: As a first step, let's create sample JSON data about the basic information about the company's departments such as Name, Employees, and so on, and save this data into the file company.json. The JSON file would look like this: [ { "Name":"DEPT_A", "No_Of_Emp":10, "No_Of_Supervisors":2 }, { "Name":"DEPT_B", "No_Of_Emp":12, "No_Of_Supervisors":2 }, { "Name":"DEPT_C", "No_Of_Emp":14, "No_Of_Supervisors":3 }, { "Name":"DEPT_D", "No_Of_Emp":10, "No_Of_Supervisors":1 }, { "Name":"DEPT_E", "No_Of_Emp":20, "No_Of_Supervisors":5 } ] You can use any online JSON editor such as http://codebeautify.org/online-json-editor to see and edit data defined in the preceding JSON code. Next, let's extend our Spark-Examples project and create a new package by the name chapter.six, and within this new package, create a new Scala object and name it as ScalaFirstSparkSQL.scala. Next, add the following import statements just below the package declaration: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql._ import org.apache.spark.sql.functions._ Further, in your main method, add following set of statements to create SQLContext from SparkContext: //Creating Spark Configuration val conf = new SparkConf() //Setting Application/ Job Name conf.setAppName("My First Spark SQL") // Define Spark Context which we will use to initialize our SQL Context val sparkCtx = new SparkContext(conf) //Creating SQL Context val sqlCtx = new SQLContext(sparkCtx) SQLContext or any of its descendants such as HiveContext—for working with Hive tables or CassandraSQLContext—for working with Cassandra tables is the main entry point for accessing all functionalities of Spark SQL. It allows the creation of data frames, and also provides functionality to fire SQL queries over data frames. Next, we will define the following code to load the JSON file (company.json) using the SQLContext, and further, we will also create a data frame: //Define path of your JSON File (company.json) which needs to be processed val path = "/home/softwares/spark/data/company.json"; //Use SQLCOntext and Load the JSON file. //This will return the DataFrame which can be further Queried using SQL queries. val dataFrame = sqlCtx.jsonFile(path) In the preceding piece of code, we used the jsonFile(…) method for loading the JSON data. There are other utility method defined by SQLContext for reading raw data from filesystem or creating data frames from the existing RDD and many more. Spark SQL supports two different methods for converting the existing RDDs into data frames. The first method uses reflection to infer the schema of an RDD from the given data. This approach leads to more concise code and helps in instances where we already know the schema while writing Spark application. We have used the same approach in our example. The second method is through a programmatic interface that allows to construct a schema. Then, apply it to an existing RDD and finally generate a data frame. This method is more verbose, but provides flexibility and helps in those instances where columns and data types are not known until the data is received at runtime. Refer to https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.SQLContext for a complete list of methods exposed by SQLContext. Once the DataFrame is created, we need to register DataFrame as a temporary table within the SQL context so that we can execute SQL queries over the registered table. Let's add the following piece of code for registering our DataFrame with our SQL context and name it company: //Register the data as a temporary table within SQL Context //Temporary table is destroyed as soon as SQL Context is destroyed. dataFrame.registerTempTable("company"); And we are done… Our JSON data is automatically organized into the table (rows/column) and is ready to accept the SQL queries. Even the data types is inferred from the type of data entered within the JSON file itself. Now, we will start executing the SQL queries on our table, but before that let's see the schema being created/defined by SQLContext: //Printing the Schema of the Data loaded in the Data Frame dataFrame.printSchema(); The execution of the preceding statement will provide results similar to mentioned illustration: The preceding illustration shows the schema of the JSON data loaded by Spark SQL. Pretty simple and straight, isn't it? Spark SQL has automatically created our schema based on the data defined in our company.json file. It has also defined the data type of each of the columns. We can also define the schema using reflection (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection) or can also programmatically define the schema (https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#inferring-the-schema-using-reflection). Next, let's execute some SQL queries to see the data stored in DataFrame, so the first SQL would be to print all records: //Executing SQL Queries to Print all records in the DataFrame println("Printing All records") sqlCtx.sql("Select * from company").collect().foreach(print) The execution of the preceding statement will produce the following results on the console where the driver is executed: Next, let's also select only few columns instead of all records and print the same on console: //Executing SQL Queries to Print Name and Employees //in each Department println("n Printing Number of Employees in All Departments") sqlCtx.sql("Select Name, No_Of_Emp from company").collect().foreach(println) The execution of the preceding statement will produce the following results on the Console where the driver is executed: Now, finally let's do some aggregation and count the total number of all employees across the departments: //Using the aggregate function (agg) to print the //total number of employees in the Company println("n Printing Total Number of Employees in Company_X") val allRec = sqlCtx.sql("Select * from company").agg(Map("No_Of_Emp"->"sum")) allRec.collect.foreach ( println ) In the preceding piece of code, we used the agg(…) function and performed the sum of all employees across the departments, where sum can be replaced by avg, max, min, or count. The execution of the preceding statement will produce the following results on the console where the driver is executed: The preceding images shows the results of executing the aggregation on our company.json data. Refer to the Data Frame API at https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.DataFrame for further information on the available functions for performing aggregation. As a last step, we will stop our Spark SQL context by invoking the stop() function on SparkContext—sparkCtx.stop(). This is required so that your application can notify master or resource manager to release all resources allocated to the Spark job. It also ensures the graceful shutdown of the job and avoids any resource leakage, which may happen otherwise. Also, as of now, there can be only one Spark context active per JVM, and we need to stop() the active SparkContext class before creating a new one. Summary In this article, we have seen the step-by-step process of using Spark SQL as a standalone program. Though we have considered JSON files as an example, but we can also leverage Spark SQL with Cassandra (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md) or MongoDB (https://github.com/Stratio/spark-mongodb) or Elasticsearch (http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html). Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames[article] Sabermetrics with Apache Spark[article] Getting Started with Apache Spark [article]

0
0
4455

article-image-integrating-elasticsearch-hadoop-ecosystem

Packt

07 Oct 2015

14 min read

Integrating Elasticsearch with the Hadoop ecosystem

Packt

07 Oct 2015

14 min read

0
0
4453

article-image-what-are-ssas-2012-dimensions-and-cube

Packt

29 Aug 2013

7 min read

What are SSAS 2012 dimensions and cube?

Packt

29 Aug 2013

7 min read

(For more resources related to this topic, see here.) What is SSAS? SQL Server Analysis Services is an online analytical processing tool that highly boosts the different types of SQL queries and calculations that are accepted in the business intelligence environment. It looks like a relation database, but it has differences. SSAS does not replace the requirement of relational databases, but if you combine the two, it would help to develop the business intelligence solutions. Why do we need SSAS? SSAS provide a very clear graphical interface for the end users to build queries. It is a kind of cache that we can use to speed up reporting. In most real scenarios where SSAS is used, there is a full copy of the data in the data warehouse. All reporting and analytic queries are run against SSAS rather than against the relational database. Today's modern relational databases include many features specifically aimed at BI reporting. SSAS are database services specifically designed for this type of workload, and in most cases it has achieved much better query performance. SSAS 2012 architecture In this article we will explain about the architecture of SSAS. The first and most important point to make about SSAS 2012 is that it is really two products in one package. It has had a few advancements relating to performance, scalability, and manageability. This new version of SSAS that closely resembles PowerPivot uses the tabular model. When installing SSAS, we must select either the tabular model or multidimensional model for installing an instance that runs inside the server; both data models are developed under the same code but sometimes both are treated separately. The concepts included in designing both data models are different, and we can't turn a tabular database into a multidimensional database, or vice versa without rebuilding everything from the start. The main point of view of the end users is that both data models do almost the same things and appear almost equally when used through a client tool such as Excel. The tabular model A concept of building a database using the tabular model is very similar to building it in a relational database. An instance of Analysis Services can hold many databases, and each database can be looked upon as a self-contained collection of objects and data relating to a single business solution. If we are writing reports or analyzing data and we find that we need to run queries on multiple databases, we probably have made a design mistake somewhere because everything we need should be contained within an individual database. Tabular models are designed by using SQL Server Data Tools (SSDT), and a data project in SSDT mapping onto a database in Analysis Services. The multidimensional model This data model is very similar to the tabular model. Data is managed in databases, and databases are designed in SSDT, which are in turn managed by using SQL Server Management Studio. The differences may become similar below the database level, where the multidimensional data model rather than relational concepts are accepted. In the multidimensional model, data is modeled as a series of cubes and dimensions and not tables. The future of Analysis Services We have two data models inside SSAS, along with two query and calculation languages; it is clearly not an ideal state of affairs. It means we have to select a data model to use at the start of our project, when we might not even know enough about our need to gauge which one is appropriate. It also means that anyone who decides to specialize in SSAS has to learn two technologies. Microsoft has very clearly said that the multidimensional model is not scrapped and that the tabular model is not its replacement. It is just like saying that the new advanced features for the multidimensional data model will be released in future versions of SSAS. The fact that the tabular and multidimensional data models share some of the same code suggests that some new features could easily be developed for both models simultaneously. What's new in SSAS 2012? As we know, there is no easy way of transferring a multidimensional data model into a tabular data model. We may have many tools in the market that claim to make this transition with a few mouse clicks, but such tools could only ever work for very simple multidimensional data models and would not save much development time. Therefore, if we already have a mature multidimensional implementation and the in-house skills to develop and maintain it, we may find the following improvements in SSAS 2012 useful. Ease of use If we are starting an SSAS 2012 project with no previous multidimensional or OLAP experience, it is very likely that we will find a tabular model much easier to learn than a multidimensional one. Not only are the concepts much easier to understand, especially if we are used to working with relational databases, but also the development process is much more straightforward and there are far fewer features to learn. Compatibility with PowerPivot The tabular data model and PowerPivot are the same in the way their models are designed. The user interfaces used are practically the same, as both the interfaces use DAX. PowerPivot models can be imported into SQL Server Data Tools to generate a tabular model, although the process does not work the other way around, and a tabular model cannot be converted to a PowerPivot model. Processing performance characteristics If we compare the processing performance of the multidimensional and tabular data models, that will become difficult. It may be slower to process a large table following the tabular data model than the equivalent measure group in a multidimensional one because a tabular data model can't process partitions in the same table at the same time, whereas a multidimensional model can process partitions in the same measure group at the same time. What is SSAS dimension? A database dimension is a collection of related objects; in other words, attributes; they provide the information about fact data in one or more cubes. Typical attributes in a product dimension are product name, product category, line, size, and price. Attributes can be organized into user-defined hierarchies that provide the paths to assist users when they browse through the data in a cube. By default these attributes are visible as attribute hierarchies, and they can be used to understand fact data in a cube. What is SSAS cube? A cube is a multidimensional structure that contains information for analytical purposes; the main constituents of a cube are dimensions and measures. Dimensions define the structure of a cube that you use to slice and dice over, and measures provide the aggregated numerical values of interest to the end user. As a logical structure, a cube allows a client application to retrieve values—of measures—as if they are contained in cells in the cube. The cells are defined for every possible summarized value. A cell, in the cube, is defined by the intersection of dimension members and contains the aggregated values of the measures at that specific intersection. Summary We talked about the special new features and services present, what you can do with them, and why they’re so great. Resources for Article: Further resources on this subject: Creating an Analysis Services Cube with Visual Studio 2008 - Part 1 [Article] Performing Common MDX-related Tasks [Article] How to Perform Iteration on Sets in MDX [Article]

0
0
4453

Packt

03 Jan 2017

15 min read

Dimensionality Reduction

Packt

03 Jan 2017

15 min read

0
0
4433

article-image-oracle-business-intelligence-drilling-data-and-down

Packt

21 Oct 2010

5 min read

Oracle Business Intelligence: Drilling Data Up and Down

Packt

21 Oct 2010

5 min read

What is data drilling? In terms of Oracle Discoverer, drilling is a technique that enables you to quickly navigate through worksheet data, finding the answers to the questions facing your business. As mentioned, depending on your needs, you can use drilling to view the data you're working with in deeper detail or, in contrast, drill it up to a higher level. The drilling to detail technique enables you to look at the values making up a particular summary value. Also, you can drill to related items, adding related information that is not currently included in the worksheet. So, Discoverer supports a set of drilling tools, including the following: Drilling up and down Drilling to a related item Drilling to detail Drilling out The following sections cover the above tools in detail, providing examples on how you might use them. Drilling to a related item Let's begin with a discussion on how to drill to a related item, adding the detailed information for a certain item. As usual, this is best understood by example. Suppose you want to drill from the Maya Silver item, which can be found on the left axis of the worksheet, to the Orddate:Day item. Here are the steps to follow: Let's first create a copy of the worksheet to work with in this example. To do this, move to the worksheet discussed in the preceding example and select the Edit | Duplicate Worksheet | As Crosstab menu of Discoverer. In the Duplicate as Crosstab dialog, just click OK. As a result a copied worksheet should appear in the workbook. On the worksheet, right-click the Maya Silver item and select Drill… in the pop-up menu: As a result, the Drill dialog should appear. In the Drill dialog, select Drill to a Related Item in the Where do you want to drill to? select box and then choose the Orddate:Day item, as shown in the following screenshot: Then, click OK to close the dialog and rearrange the data on the worksheet. The reorganized worksheet should now look like the following one: As you can see, this shows the Maya Silver item broken down into day sales per product. Now suppose you want to see a more detailed view of the Maya Silver item and break it out further into product category. Right-click the Maya Silver item and select Drill… in the pop-up menu. In the Drill dialog, select Drill to a Related Item in the Where do you want to drill to? select box and then choose the Category item. Next, click OK.The resulting worksheet should look now like this: As you can see, the result of the drilling operations you just performed is that you can see the dollar amount for Maya Silver detailed by category, by day, by product. You may be asking yourself if it's possible to change the order in which the Maya Silver record is detailed. Say, you want to see it detailed in the following order: by day, by category, and finally by product. The answer is sure. On the left axis of the worksheet, drag the Orddate:Day item (the third from the left) to the second position within the same left axis, just before the Category item, as shown in the following screenshot: As a result, you should see that the data on the worksheet has been rearranged as shown in the following screenshot: Having just a few rows in the underlying tables, as we have here, is OK for demonstration purposes, since it results in compact screenshots. To see more meaningful figures on the worksheet though, you might insert more rows into the orderitems, orders, and products underlying tables. Once you're done with it, you can click the Refresh button on the Discoverer toolbar to see an updated worksheet. Select the File | Save menu option of Discoverer to save the worksheet discussed here. Drilling up and down As the name implies, drilling down is a technique you can use to float down a drill hierarchy to see data in more detail. And drilling up is the reverse operation, which you can use to slide up a drill hierarchy to see consolidated data. But what is a drill hierarchy? Working with drill hierarchies A drill hierarchy represents a set of items related to each other according to the foreign key relationships in the underlying tables. If a worksheet item is associated with a drill hierarchy, you can look at that hierarchy by clicking the drill icon located at the left of the heading of the worksheet item. Suppose you want to look at the hierarchy associated with the Orddate item located on our worksheet at the top axis. To do this, click the Orddate drill icon. As a result, you should see the menu shown in the following screeenshot: As you can see, you can drill up here from Orddate to Year, Quarter, or Month. The next screenshot illustrates what you would have if you chose Month. It's important to note that you may have more than one hierarchy associated with a worksheet item. In this case, you can move on to the hierarchy you want to use through the All Hierarchies option on the drill menu.

0
0
4424

article-image-logistic-regression-using-tensorflow

Packt

06 Mar 2018

9 min read

Logistic Regression Using TensorFlow

Packt

06 Mar 2018

9 min read

In this article, by PKS Prakash and Achyutuni Sri Krishna Rao, authors of R Deep Learning Cookbook we will learn how to Perform logistic regression using TensorFlow. In this recipe, we will cover the application of TensorFlow in setting up a logistic regression model. The example will use a similar dataset to that used in the H2O model setup. (For more resources related to this topic, see here.) What is TensorFlow TensorFlow is another open source library developed by the Google Brain Team to build numerical computation models using data flow graphs. The core of TensorFlow was developed in C++ with the wrapper in Python. The tensorflow package in R gives you access to the TensorFlow API composed of Python modules to execute computation models. TensorFlow supports both CPU- and GPU-based computations. The tensorflow package in R calls the Python tensorflow API for execution, which is essential to install the tensorflow package in both R and Python to make R work. The following are the dependencies for tensorflow: Python 2.7 / 3.x R (>3.2) devtools package in R for installing TensorFlow from GitHub TensorFlow in Python pip Getting ready The code for this section is created on Linux but can be run on any operating system. To start modeling, load the tensorflow package in the environment. R loads the default TensorFlow environment variable and also the NumPy library from Python in the np variable: library("tensorflow") # Load TensorFlow np <- import("numpy") # Load numpy library How to do it... The data is imported using a standard function from R, as shown in the following code. The data is imported using the read.csv file and transformed into the matrix format followed by selecting the features used to model as defined in xFeatures and yFeatures. The next step in TensorFlow is to set up a graph to run optimization: # Loading input and test data xFeatures = c("Temperature", "Humidity", "Light", "CO2", "HumidityRatio") yFeatures = "Occupancy" occupancy_train <- as.matrix(read.csv("datatraining.txt",stringsAsFactors = T)) occupancy_test <- as.matrix(read.csv("datatest.txt",stringsAsFactors = T)) # subset features for modeling and transform to numeric values occupancy_train<-apply(occupancy_train[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) occupancy_test<-apply(occupancy_test[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) # Data dimensions nFeatures<-length(xFeatures) nRow<-nrow(occupancy_train) Before setting up the graph, let's reset the graph using the following command: # Reset the graph tf$reset_default_graph() Additionally, let's start an interactive session as it will allow us to execute variables without referring to the session-to-session object: # Starting session as interactive session sess<-tf$InteractiveSession() Define the logistic regression model in TensorFlow: # Setting-up Logistic regression graph x <- tf$constant(unlist(occupancy_train[, xFeatures]), shape=c(nRow, nFeatures), dtype=np$float32) # W <- tf$Variable(tf$random_uniform(shape(nFeatures, 1L))) b <- tf$Variable(tf$zeros(shape(1L))) y <- tf$matmul(x, W) + b The input feature x is defined as a constant as it will be an input to the system. The weight W and bias b are defined as variables that will be optimized during the optimization process. The y is set up as a symbolic representation between x, W, and b. The weight W is set up to initialize random uniform distribution and b is assigned the value zero. The next step is to set up the cost function for logistic regression: # Setting-up cost function and optimizer y_ <- tf$constant(unlist(occupancy_train[, yFeatures]), dtype="float32", shape=c(nRow, 1L)) cross_entropy<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=y_, logits=y, name="cross_entropy")) optimizer <- tf$train$GradientDescentOptimizer(0.15)$minimize(cross_entr opy) # Start a session init <- tf$global_variables_initializer() sess$run(init) Execute the gradient descent algorithm for the optimization of weights using cross entropy as the loss function: # Running optimization for (step in 1:5000) { sess$run(optimizer) if (step %% 20== 0) cat(step, "-", sess$run(W), sess$run(b), "==>", sess$run(cross_entropy), "n") } How it works... The performance of the model can be evaluated using AUC: # Performance on Train library(pROC) ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) # Performance on test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)). AUC can be visualized using the plot.auc function from the pROC package, as shown in the screenshot following this command. The performance for training and testing (holdout) is very similar. plot.roc(roc_obj, col = "green", lty=2, lwd=2) plot.roc(roc_objt, add=T, col="red", lty=4, lwd=2) Performance of logistic regression using TensorFlow Visualizing TensorFlow graphs TensorFlow graphs can be visualized using TensorBoard. It is a service that utilizes TensorFlow event files to visualize TensorFlow models as graphs. Graph model visualization in TensorBoard is also used to debug TensorFlow models. Getting ready TensorBoard can be started using the following command in the terminal: $ tensorboard --logdir home/log --port 6006 The following are the major parameters for TensorBoard: --logdir : To map to the directory to load TensorFlow events --debug: To increase log verbosity --host: To define the host to listen to its localhost (127.0.0.1) by default --port: To define the port to which TensorBoard will serve The preceding command will launch the TensorFlow service on localhost at port 6006, as shown in the following screenshot: TensorBoard The tabs on the TensorBoard capture relevant data generated during graph execution. How to do it... The section covers how to visualize TensorFlow models and output in TernsorBoard. To visualize summaries and graphs, data from TensorFlow can be exported using the FileWriter command from the summary module. A default session graph can be added using the following command: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) The graph for logistic regression developed using the preceding code is shown in the following screenshot: Visualization of the logistic regression graph in TensorBoard Similarly, other variable summaries can be added to the TensorBoard using correct summaries, as shown in the following code: # Adding histogram summary to weight and bias variable w_hist = tf$histogram_summary("weights", W) b_hist = tf$histogram_summary("biases", b) Create a cross entropy evaluation for test. An example script to generate the cross entropy cost function for test and train is shown in the following command: # Set-up cross entropy for test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- tf$nn$sigmoid(tf$matmul(xt, W) + b) yt_ <- tf$constant(unlist(occupancy_test[, yFeatures]), dtype="float32", shape=c(nRowt, 1L)) cross_entropy_tst<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=yt_, logits=ypredt, name="cross_entropy_tst")) Add summary variables to be collected: # Add summary ops to collect data w_hist = tf$summary$histogram("weights", W) b_hist = tf$summary$histogram("biases", b) crossEntropySummary<-tf$summary$scalar("costFunction", cross_entropy) crossEntropyTstSummary<- tf$summary$scalar("costFunction_test", cross_entropy_tst) Open the writing object, log_writer. It writes the default graph to the location, c:/log: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) Run the optimization and collect the summaries: for (step in 1:2500) { sess$run(optimizer) # Evaluate performance on training and test data after 50 Iteration if (step %% 50== 0){ ### Performance on Train ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) ### Performance on Test ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)) cat("train AUC: ", auc(roc_obj), " Test AUC: ", auc(roc_objt), "n") # Save summary of Bias and weights log_writer$add_summary(sess$run(b_hist), global_step=step) log_writer$add_summary(sess$run(w_hist), global_step=step) log_writer$add_summary(sess$run(crossEntropySummary), global_step=step) log_writer$add_summary(sess$run(crossEntropyTstSummary), global_step=step) } } Collect all the summaries to a single tensor using the merge_all command from the summary module: summary = tf$summary$merge_all() Write the summaries to the log file using the log_writer object: log_writer = tf$summary$FileWriter('c:/log', sess$graph) summary_str = sess$run(summary) log_writer$add_summary(summary_str, step) log_writer$close() Summary In this article, we have learned how to perform logistic regression using TensorFlow also we have covered the application of TensorFlow in setting up a logistic regression model. Resources for Article: Further resources on this subject: [article] [article] [article]

0
0
4403

article-image-top-5-machine-learning-movies

Chris Key

17 Oct 2017

3 min read

Top 5 Machine Learning Movies

Chris Key

17 Oct 2017

3 min read

Sitting in Mumbai airport at 2am can lead to some truly random conversations. Discussing the plot of Short Circuit 2 led us to thinking about this article. Here's my list of the top 5 movies featuring advanced machine learning. Short Circuit 2 [imdb] "Hey laser-lips, your momma was a snow blower!" A plucky robot who has named himself Johnny 5 returns to the screens to help build toy robots in a big city. By this point he is considered to have actual intelligence rather than artificial intelligence, however the plot of the film centres around his naivety and lack of ability to see the dark motives behind his new buddy, Oscar. We learn that intelligence can be applied anywhere, but sometimes it is the wrong place. Or right if you like stealing car stereos for "Los Locos". The Matrix Revolutions [imdb] The robots learn to balance an equation. Bet you wish you had them in your math high-school class. Also kudos to the Wachowski brothers who learnt from the machines the ability to balance the equation and released this monstrosity to even out the universe in light of the amazing first film in the trilogy. Blade Runner [imdb] “I've seen things you people wouldn't believe.” In the ultimate example of machines (see footnote) learning to emulate humanity, we struggled for 30 years to understand if Deckard was really human or a Nexus (spoilers: he is almost certainly a replicant!). It is interesting to note that when Pris and Roy are teamed up with JF Sebastian, their behaviours, aside from the occasional murder, show them to be more socially aware than their genius inventor friend. Wall-E [imdb] Disney and Pixar made a movie with no dialog for the entire first half, yet it was enthralling to watch. Without saying a single word, we see a small utility robot display a full range of emotions that we can relate to. He also demonstrates other signs of life – his need for energy and rest, and his sense of purpose is divided between his prime directive of cleaning the planet, and his passion for collecting interesting objects. Terminator 2 [imdb] “I know now why you cry, but it is something I can never do” Sarah Connor tells us that “Watching John with the machine, it was suddenly so clear. The terminator, would never stop. It would never leave him, and it would never hurt him, never shout at him, or get drunk and hit him, or say it was too busy to spend time with him. It would always be there. And it would die, to protect him.” Yet John Connor teaches the deadly robot, played by the invincible ex-Governator Arnold Schwarzenegger, how to be normal in society. No Problemo. Gimme five. Hasta La Vista, baby. Footnote - replicants aren't really machines. The replicants are genetic engineered and created by the Tyrell corporation with limited lifespans and specific abilities. For all intents and purposes, they are really organic robots.

0
0
4394

How-To Tutorials - Data

Creating a Simple Report using BIRT

Morphology – Getting Our Feet Wet

HBase Administration, Performance Tuning

Customizing heat maps (Intermediate)

Analyzing Social Networks with Facebook

The US Department of Commerce wants to regulate export of AI and related products

Moving Further with NumPy Modules

Getting started with Machine Learning in H2O

Integration with Spark SQL

Integrating Elasticsearch with the Hadoop ecosystem

Trending Topics

What are SSAS 2012 dimensions and cube?

Dimensionality Reduction

Oracle Business Intelligence: Drilling Data Up and Down

Logistic Regression Using TensorFlow

Top 5 Machine Learning Movies

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access