Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-build-and-train-rnn-chatbot-using-tensorflow

28 Jun 2018

21 min read

Build and train an RNN chatbot using TensorFlow [Tutorial]

28 Jun 2018

Chatbots are increasingly used as a way to provide assistance to users. Many companies, including banks, mobile/landline companies and large e-sellers now use chatbots for customer assistance and for helping users in pre and post sales queries. They are a great tool for companies which don't need to provide additional customer service capacity for trivial questions: they really look like a win-win situation! In today’s tutorial, we will understand how to train an automatic chatbot that will be able to answer simple and generic questions, and how to create an endpoint over HTTP for providing the answers via an API. This article is an excerpt from a book written by Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur, and Rajalingappaa Shanmugamani titled TensorFlow Deep Learning Projects. There are mainly two types of chatbot: the first is a simple one, which tries to understand the topic, always providing the same answer for all questions about the same topic. For example, on a train website, the questions Where can I find the timetable of the City_A to City_B service? and What's the next train departing from City_A? will likely get the same answer, that could read Hi! The timetable on our network is available on this page: <link>. This types of chatbots use classification algorithms to understand the topic (in the example, both questions are about the timetable topic). Given the topic, they always provide the same answer. Usually, they have a list of N topics and N answers; also, if the probability of the classified topic is low (the question is too vague, or it's on a topic not included in the list), they usually ask the user to be more specific and repeat the question, eventually pointing out other ways to do the question (send an email or call the customer service number, for example). The second type of chatbots is more advanced, smarter, but also more complex. For those, the answers are built using an RNN, in the same way, that machine translation is performed. Those chatbots are able to provide more personalized answers, and they may provide a more specific reply. In fact, they don't just guess the topic, but with an RNN engine, they're able to understand more about the user's questions and provide the best possible answer: in fact, it's very unlikely you'll get the same answers with two different questions using these types of chatbots. The input corpus Unfortunately, we haven't found any consumer-oriented dataset that is open source and freely available on the Internet. Therefore, we will train the chatbot with a more generic dataset, not really focused on customer service. Specifically, we will use the Cornell Movie Dialogs Corpus, from the Cornell University. The corpus contains the collection of conversations extracted from raw movie scripts, therefore the chatbot will be able to answer more to fictional questions than real ones. The Cornell corpus contains more than 200,000 conversational exchanges between 10+ thousands of movie characters, extracted from 617 movies. The dataset is available here: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. We would like to thank the authors for having released the corpus: that makes experimentation, reproducibility and knowledge sharing easier. The dataset comes as a .zip archive file. After decompressing it, you'll find several files in it: README.txt contains the description of the dataset, the format of the corpora files, the details on the collection procedure and the author's contact. Chameleons.pdf is the original paper for which the corpus has been released. Although the goal of the paper is strictly not around chatbots, it studies the language used in dialogues, and it's a good source of information to understanding more movie_conversations.txt contains all the dialogues structure. For each conversation, it includes the ID of the two characters involved in the discussion, the ID of the movie and the list of sentences IDs (or utterances, to be more precise) in chronological order. For example, the first line of the file is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'] That means that user u0 had a conversation with user u2 in the movie m0 and the conversation had 4 utterances: 'L194', 'L195', 'L196' and 'L197' movie_lines.txt contains the actual text of each utterance ID and the person who produced it. For example, the utterance L195 is listed here as: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. So, the text of the utterance L195 is Well, I thought we'd start with pronunciation, if that's okay with you. And it was pronounced by the character u2 whose name is CAMERON in the movie m0. movie_titles_metadata.txt contains information about the movies, including the title, year, IMDB rating, the number of votes in IMDB and the genres. For example, the movie m0 here is described as: m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance'] So, the title of the movie whose ID is m0 is 10 things i hate about you, it's from 1999, it's a comedy with romance and it received almost 63 thousand votes on IMDB with an average score of 6.9 (over 10.0) movie_characters_metadata.txt contains information about the movie characters, including the name the title of the movie where he/she appears, the gender (if known) and the position in the credits (if known). For example, the character “u2” appears in this file with this description: u2 +++$+++ CAMERON +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ m +++$+++ 3 The character u2 is named CAMERON, it appears in the movie m0 whose title is 10 things i hate about you, his gender is male and he's the third person appearing in the credits. raw_script_urls.txt contains the source URL where the dialogues of each movie can be retrieved. For example, for the movie m0 that's it: m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html As you will have noticed, most files use the token +++$+++ to separate the fields. Beyond that, the format looks pretty straightforward to parse. Please take particular care while parsing the files: their format is not UTF-8 but ISO-8859-1. Creating the training dataset Let's now create the training set for the chatbot. We'd need all the conversations between the characters in the correct order: fortunately, the corpora contains more than what we actually need. For creating the dataset, we will start by downloading the zip archive, if it's not already on disk. We'll then decompress the archive in a temporary folder (if you're using Windows, that should be C:Temp), and we will read just the movie_lines.txt and the movie_conversations.txt files, the ones we really need to create a dataset of consecutive utterances. Let's now go step by step, creating multiple functions, one for each step, in the file corpora_downloader.py. The first function we need is to retrieve the file from the Internet, if not available on disk. def download_and_decompress(url, storage_path, storage_dir): import os.path directory = storage_path + "/" + storage_dir zip_file = directory + ".zip" a_file = directory + "/cornell movie-dialogs corpus/README.txt" if not os.path.isfile(a_file): import urllib.request import zipfile urllib.request.urlretrieve(url, zip_file) with zipfile.ZipFile(zip_file, "r") as zfh: zfh.extractall(directory) return This function does exactly that: it checks whether the “README.txt” file is available locally; if not, it downloads the file (thanks for the urlretrieve function in the urllib.request module) and it decompresses the zip (using the zipfile module). The next step is to read the conversation file and extract the list of utterance IDS. As a reminder, its format is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'], therefore what we're looking for is the fourth element of the list after we split it on the token +++$+++ . Also, we'd need to clean up the square brackets and the apostrophes to have a clean list of IDs. For doing that, we shall import the re module, and the function will look like this. import re def read_conversations(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_conversations.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: conversations_chunks = [line.split(" +++$+++ ") for line in fh] return [re.sub('[[]']', '', el[3].strip()).split(", ") for el in conversations_chunks] As previously said, remember to read the file with the right encoding, otherwise, you'll get an error. The output of this function is a list of lists, each of them containing the sequence of utterance IDS in a conversation between characters. Next step is to read and parse the movie_lines.txt file, to extract the actual utterances texts. As a reminder, the file looks like this line: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. Here, what we're looking for are the first and the last chunks. def read_lines(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_lines.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: lines_chunks = [line.split(" +++$+++ ") for line in fh] return {line[0]: line[-1].strip() for line in lines_chunks} The very last bit is about tokenization and alignment. We'd like to have a set whose observations have two sequential utterances. In this way, we will train the chatbot, given the first utterance, to provide the next one. Hopefully, this will lead to a smart chatbot, able to reply to multiple questions. Here's the function: def get_tokenized_sequencial_sentences(list_of_lines, line_text): for line in list_of_lines: for i in range(len(line) - 1): yield (line_text[line[i]].split(" "), line_text[line[i+1]].split(" ")) Its output is a generator containing a tuple of the two utterances (the one on the right follows temporally the one on the left). Also, utterances are tokenized on the space character. Finally, we can wrap up everything into a function, which downloads the file and unzip it (if not cached), parse the conversations and the lines, and format the dataset as a generator. As a default, we will store the files in the /tmp directory: def retrieve_cornell_corpora(storage_path="/tmp", storage_dir="cornell_movie_dialogs_corpus"): download_and_decompress("http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip", storage_path, storage_dir) conversations = read_conversations(storage_path, storage_dir) lines = read_lines(storage_path, storage_dir) return tuple(zip(*list(get_tokenized_sequencial_sentences(conversations, lines)))) At this point, our training set looks very similar to the training set used in the translation project. We can, therefore, use some pieces of code we've developed in the machine learning translation article. For example, the corpora_tools.py file can be used here without any change (also, it requires the data_utils.py). Given that file, we can dig more into the corpora, with a script to check the chatbot input. To inspect the corpora, we can use the corpora_tools.py, and the file we've previously created. Let's retrieve the Cornell Movie Dialog Corpus, format the corpora and print an example and its length: from corpora_tools import * from corpora_downloader import retrieve_cornell_corpora sen_l1, sen_l2 = retrieve_cornell_corpora() print("# Two consecutive sentences in a conversation") print("Q:", sen_l1[0]) print("A:", sen_l2[0]) print("# Corpora length (i.e. number of sentences)") print(len(sen_l1)) assert len(sen_l1) == len(sen_l2) This code prints an example of two tokenized consecutive utterances, and the number of examples in the dataset, that is more than 220,000: # Two consecutive sentences in a conversation Q: ['Can', 'we', 'make', 'this', 'quick?', '', 'Roxanne', 'Korrine', 'and', 'Andrew', 'Barrett', 'are', 'having', 'an', 'incredibly', 'horrendous', 'public', 'break-', 'up', 'on', 'the', 'quad.', '', 'Again.'] A: ['Well,', 'I', 'thought', "we'd", 'start', 'with', 'pronunciation,', 'if', "that's", 'okay', 'with', 'you.'] # Corpora length (i.e. number of sentences) 221616 Let's now clean the punctuation in the sentences, lowercase them and limits their size to 20 words maximum (that is examples where at least one of the sentences is longer than 20 words are discarded). This is needed to standardize the tokens: clean_sen_l1 = [clean_sentence(s) for s in sen_l1] clean_sen_l2 = [clean_sentence(s) for s in sen_l2] filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2) print("# Filtered Corpora length (i.e. number of sentences)") print(len(filt_clean_sen_l1)) assert len(filt_clean_sen_l1) == len(filt_clean_sen_l2) This leads us to almost 140,000 examples: # Filtered Corpora length (i.e. number of sentences) 140261 Then, let's create the dictionaries for the two sets of sentences. Practically, they should look the same (since the same sentence appears once on the left side, and once in the right side) except there might be some changes introduced by the first and last sentences of a conversation (they appear only once). To make the best out of our corpora, let's build two dictionaries of words and then encode all the words in the corpora with their dictionary indexes: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=15000, storage_path="/tmp/l1_dict.p") dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=15000, storage_path="/tmp/l2_dict.p") idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) print("# Same sentences as before, with their dictionary ID") print("Q:", list(zip(filt_clean_sen_l1[0], idx_sentences_l1[0]))) print("A:", list(zip(filt_clean_sen_l2[0], idx_sentences_l2[0]))) That prints the following output. We also notice that a dictionary of 15 thousand entries doesn't contain all the words and more than 16 thousand (less popular) of them don't fit into it: [sentences_to_indexes] Did not find 16823 words [sentences_to_indexes] Did not find 16649 words # Same sentences as before, with their dictionary ID Q: [('well', 68), (',', 8), ('i', 9), ('thought', 141), ('we', 23), ("'", 5), ('d', 83), ('start', 370), ('with', 46), ('pronunciation', 3), (',', 8), ('if', 78), ('that', 18), ("'", 5), ('s', 12), ('okay', 92), ('with', 46), ('you', 7), ('.', 4)] A: [('not', 31), ('the', 10), ('hacking', 7309), ('and', 23), ('gagging', 8761), ('and', 23), ('spitting', 6354), ('part', 437), ('.', 4), ('please', 145), ('.', 4)] As the final step, let's add paddings and markings to the sentences: data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) print("# Prepared minibatch with paddings and extra stuff") print("Q:", data_set[0][0]) print("A:", data_set[0][1]) print("# The sentence pass from X to Y tokens") print("Q:", len(idx_sentences_l1[0]), "->", len(data_set[0][0])) print("A:", len(idx_sentences_l2[0]), "->", len(data_set[0][1])) And that, as expected, prints: # Prepared minibatch with paddings and extra stuff Q: [0, 68, 8, 9, 141, 23, 5, 83, 370, 46, 3, 8, 78, 18, 5, 12, 92, 46, 7, 4] A: [1, 31, 10, 7309, 23, 8761, 23, 6354, 437, 4, 145, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0] # The sentence pass from X to Y tokens Q: 19 -> 20 A: 11 -> 22 Training the chatbot After we're done with the corpora, it's now time to work on the model. This project requires again a sequence to sequence model, therefore we can use an RNN. Even more, we can reuse part of the code from the previous project: we'd just need to change how the dataset is built, and the parameters of the model. We can then copy the training script, and modify the build_dataset function, to use the Cornell dataset. Mind that the dataset used in this article is bigger than the one used in the machine learning translation article, therefore you may need to limit the corpora to a few dozen thousand lines. On a 4 years old laptop with 8GB RAM, we had to select only the first 30 thousand lines, otherwise, the program ran out of memory and kept swapping. As a side effect of having fewer examples, even the dictionaries are smaller, resulting in less than 10 thousands words each. def build_dataset(use_stored_dictionary=False): sen_l1, sen_l2 = retrieve_cornell_corpora() clean_sen_l1 = [clean_sentence(s) for s in sen_l1][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP clean_sen_l2 = [clean_sentence(s) for s in sen_l2][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2, max_len=10) if not use_stored_dictionary: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=10000, storage_path=path_l1_dict) dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=10000, storage_path=path_l2_dict) else: dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2_length = len(dict_l2) idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) max_length_l1 = extract_max_length(idx_sentences_l1) max_length_l2 = extract_max_length(idx_sentences_l2) data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) return (filt_clean_sen_l1, filt_clean_sen_l2), data_set, (max_length_l1, max_length_l2), (dict_l1_length, dict_l2_length) By inserting this function into the train_translator.py file and rename the file as train_chatbot.py, we can run the training of the chatbot. After a few iterations, you can stop the program and you'll see something similar to this output: [sentences_to_indexes] Did not find 0 words [sentences_to_indexes] Did not find 0 words global step 100 learning rate 1.0 step-time 7.708967611789704 perplexity 444.90090078460474 eval: perplexity 57.442316329639176 global step 200 learning rate 0.990234375 step-time 7.700247814655302 perplexity 48.8545568311572 eval: perplexity 42.190180314697045 global step 300 learning rate 0.98046875 step-time 7.69800933599472 perplexity 41.620538109894945 eval: perplexity 31.291903031786116 ... ... ... global step 2400 learning rate 0.79833984375 step-time 7.686293318271639 perplexity 3.7086356605442767 eval: perplexity 2.8348589631663046 global step 2500 learning rate 0.79052734375 step-time 7.689657487869262 perplexity 3.211876894960698 eval: perplexity 2.973809378544393 global step 2600 learning rate 0.78271484375 step-time 7.690396382808681 perplexity 2.878854805600354 eval: perplexity 2.563583924617356 Again, if you change the settings, you may end up with a different perplexity. To obtain these results, we set the RNN size to 256 and 2 layers, the batch size of 128 samples, and the learning rate to 1.0. At this point, the chatbot is ready to be tested. Although you can test the chatbot with the same code as in the test_translator.py, here we would like to do a more elaborate solution, which allows exposing the chatbot as a service with APIs. Chatbox API First of all, we need a web framework to expose the API. In this project, we've chosen Bottle, a lightweight simple framework very easy to use. To install the package, run pip install bottle from the command line. To gather further information and dig into the code, take a look at the project webpage, https://bottlepy.org. Let's now create a function to parse an arbitrary sentence provided by the user as an argument. All the following code should live in the test_chatbot_aas.py file. Let's start with some imports and the function to clean, tokenize and prepare the sentence using the dictionary: import pickle import sys import numpy as np import tensorflow as tf import data_utils from corpora_tools import clean_sentence, sentences_to_indexes, prepare_sentences from train_chatbot import get_seq2seq_model, path_l1_dict, path_l2_dict model_dir = "/home/abc/chat/chatbot_model" def prepare_sentence(sentence, dict_l1, max_length): sents = [sentence.split(" ")] clean_sen_l1 = [clean_sentence(s) for s in sents] idx_sentences_l1 = sentences_to_indexes(clean_sen_l1, dict_l1) data_set = prepare_sentences(idx_sentences_l1, [[]], max_length, max_length) sentences = (clean_sen_l1, [[]]) return sentences, data_set The function prepare_sentence does the following: Tokenizes the input sentence Cleans it (lowercase and punctuation cleanup) Converts tokens to dictionary IDs Add markers and paddings to reach the default length Next, we will need a function to convert the predicted sequence of numbers to an actual sentence composed of words. This is done by the function decode, which runs the prediction given the input sentence and with softmax predicts the most likely output. Finally, it returns the sentence without paddings and markers: def decode(data_set): with tf.Session() as sess: model = get_seq2seq_model(sess, True, dict_lengths, max_sentence_lengths, model_dir) model.batch_size = 1 bucket = 0 encoder_inputs, decoder_inputs, target_weights = model.get_batch( {bucket: [(data_set[0][0], [])]}, bucket) _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket, True) outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] if data_utils.EOS_ID in outputs: outputs = outputs[1:outputs.index(data_utils.EOS_ID)] tf.reset_default_graph() return " ".join([tf.compat.as_str(inv_dict_l2[output]) for output in outputs]) Finally, the main function, that is, the function to run in the script: if __name__ == "__main__": dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l2_length = len(dict_l2) inv_dict_l2 = {v: k for k, v in dict_l2.items()} max_lengths = 10 dict_lengths = (dict_l1_length, dict_l2_length) max_sentence_lengths = (max_lengths, max_lengths) from bottle import route, run, request @route('/api') def api(): in_sentence = request.query.sentence _, data_set = prepare_sentence(in_sentence, dict_l1, max_lengths) resp = [{"in": in_sentence, "out": decode(data_set)}] return dict(data=resp) run(host='127.0.0.1', port=8080, reloader=True, debug=True) Initially, it loads the dictionary and prepares the inverse dictionary. Then, it uses the Bottle API to create an HTTP GET endpoint (under the /api URL). The route decorator sets and enriches the function to run when the endpoint is contacted via HTTP GET. In this case, the api() function is run, which first reads the sentence passed as HTTP parameter, then calls the prepare_sentence function, described above, and finally runs the decoding step. What's returned is a dictionary containing both the input sentence provided by the user and the reply of the chatbot. Finally, the webserver is turned on, on the localhost at port 8080. Isn't very easy to have a chatbot as a service with Bottle? It's now time to run it and check the outputs. To run it, run from the command line: $> python3 –u test_chatbot_aas.py Then, let's start querying the chatbot with some generic questions, to do so we can use CURL, a simple command line; also all the browsers are ok, just remember that the URL should be encoded, for example, the space character should be replaced with its encoding, that is, %20. Curl makes things easier, having a simple way to encode the URL request. Here are a couple of examples: $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "i ' m here with you .", "in": "where are you?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you here?" {"data": [{"out": "yes .", "in": "are you here?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you a chatbot?" {"data": [{"out": "you ' for the stuff to be right .", "in": "are you a chatbot?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=what is your name ?" {"data": [{"out": "we don ' t know .", "in": "what is your name ?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "that ' s okay .", "in": "how are you?"}]} If the system doesn't work with your browser, try encoding the URL, for example: $> curl -X GET http://127.0.0.1:8080/api?sentence=how%20are%20you? {"data": [{"out": "that ' s okay .", "in": "how are you?"}]}. Replies are quite funny; always remember that we trained the chatbox on movies, therefore the type of replies follow that style. To turn off the webserver, use Ctrl + C. To summarize, we've learned to implement a chatbot, which is able to respond to questions through an HTTP endpoint and a GET API. To know more how to design deep learning systems for a variety of real-world scenarios using TensorFlow, do checkout this book TensorFlow Deep Learning Projects. Facebook’s Wit.ai: Why we need yet another chatbot development framework? How to build a chatbot with Microsoft Bot framework Top 4 chatbot development frameworks for developers

0
1
44455

article-image-google-is-circumventing-gdpr-reveals-braves-investigation-for-the-authorized-buyers-ad-business-case

Bhagyashree R

06 Sep 2019

6 min read

Google is circumventing GDPR, reveals Brave's investigation for the Authorized Buyers ad business case

Bhagyashree R

06 Sep 2019

6 min read

Last year, Dr. Johnny Ryan, the Chief Policy & Industry Relations Officer at Brave, filed a complaint against Google’s DoubleClick/Authorized Buyers ad business with the Irish Data Protection Commission (DPC). New evidence produced by Brave reveals that Google is circumventing GDPR and also undermining its own data protection measures. Brave calls Google’s Push Pages a GDPR workaround Brave’s new evidence rebuts some of Google’s claims regarding its DoubleClick/Authorized Buyers system, the world’s largest real-time advertising auction house. Google says that it prohibits companies that use its real-time bidding (RTB) ad system “from joining data they receive from the Cookie Matching Service.” In September last year, Google announced that it has removed encrypted cookie IDs and list names from bid requests with buyers in its Authorized Buyers marketplace. Brave’s research, however, found otherwise, “Brave’s new evidence reveals that Google allowed not only one additional party, but many, to match with Google identifiers. The evidence further reveals that Google allowed multiple parties to match their identifiers for the data subject with each other.” When you visit a website that has Google ads embedded on its web pages, Google will run a real-time bidding ad auction to determine which advertiser will get to display its ads. For this, it uses Push Pages, which is the mechanism in question here. Brave hired Zach Edwards, the co-founder of digital analytics startup Victory Medium, and MetaX, a company that audits data supply chains, to investigate and analyze a log of Dr. Ryan’s web browsing. The research revealed that Google's Push Pages can essentially be used as a workaround for user IDs. Google shares a ‘google_push’ identifier with the participating companies to identify a user. Brave says that the problem here is that the identifier that was shared was common to multiple companies. This means that these companies could have cross-referenced what they learned about the user from Google with each other. Used by more than 8.4 million websites, Google's DoubleClick/Authorized Buyers broadcasts personal data of users to 2000+ companies. This data includes the category of what a user is reading, which can reveal their political views, sexual orientation, religious beliefs, as well as their locations. There are also unique ID codes that are specific to a user that can let companies uniquely identify a user. All this information can give these companies a way to keep tabs on what users are “reading, watching, and listening to online.” Brave calls Google’s RTB data protection policies “weak” as they ask these companies to self-regulate. Google does not have much control over what these companies do with the data once broadcast. “Its policy requires only that the thousands of companies that Google shares peoples’ sensitive data with monitor their own compliance, and judge for themselves what they should do,” Brave wrote. A Google spokesperson, as a response to this news, told Forbes, “We do not serve personalised ads or send bid requests to bidders without user consent. The Irish DPC — as Google's lead DPA — and the UK ICO are already looking into real-time bidding in order to assess its compliance with GDPR. We welcome that work and are co-operating in full." Users recommend starting an “information campaign” instead of a penalty that will hardly affect the big tech This news triggered a discussion on Hacker News where users talked about the implications of RTB and what strict actions the EU can take to protect user privacy. A user explained, "So, let's say you're an online retailer, and you have Google IDs for your customers. You probably have some useful and sensitive customer information, like names, emails, addresses, and purchase histories. In order to better target your ads, you could participate in one of these exchanges, so that you can use the information you receive to suggest products that are as relevant as possible to each customer. To participate, you send all this sensitive information, along with a Google ID, and receive similar information from other retailers, online services, video games, banks, credit card providers, insurers, mortgage brokers, service providers, and more! And now you know what sort of vehicles your customers drive, how much they make, whether they're married, how many kids they have, which websites they browse, etc. So useful! And not only do you get all these juicy private details, but you've also shared your customers sensitive purchase history with anyone else who is connected to the exchange." Others said that a penalty is not going to deter Google. "The whole penalty system is quite silly. The fines destroy small companies who are the ones struggling to comply, and do little more than offer extremely gentle pokes on the wrist for megacorps that have relatively unlimited resources available for complete compliance, if they actually wanted to comply." Users suggested that the EU should instead start an information campaign. "EU should ignore the fines this time and start an "information campaign" regarding behavior of Google and others. I bet that hurts Google 10 times more." Some also said that not just Google but the RTB participants should also be held responsible. "Because what Google is doing is not dissimilar to how any other RTB participant is acting, saying this is a Google workaround seems disingenuous." With this case, Brave has launched a full-fledged campaign that aims to “reform the multi-billion dollar RTB industry spans sixteen EU countries.” To achieve this goal it has collaborated with several privacy NGOs and academics including the Open Rights Group, Dr. Michael Veale of the Turing Institute, among others. In other news, a Bloomberg report reveals that Google and other internet companies have recently asked for an amendment to the California Consumer Privacy Act, which will be enacted in 2020. The law currently limits how digital advertising companies collect and make money from user data. The amendments proposed include approval for collecting user data for targeted advertising, using the collected data from websites for their own analysis, and many others. Read the Bloomberg report to know more in detail. Other news in Data Facebook content moderators work in filthy, stressful conditions and experience emotional trauma daily, reports The Verge GDPR complaint in EU claim billions of personal data leaked via online advertising bids European Union fined Google 1.49 billion euros for antitrust violations in online advertising

0
0
44217

article-image-clean-social-media-data-analysis-python

Amey Varangaonkar

26 Dec 2017

10 min read

How to effectively clean social media data for analysis

Amey Varangaonkar

26 Dec 2017

10 min read

[box type="note" align="" class="" width=""]This article is a book extract from Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk.[/box] Data cleaning and preprocessing is an essential - and often crucial - part of any analytical process. In this excerpt, we explain the different techniques and mechanisms for effective analysis of your social media data. Social media contains different types of data: information about user profiles, statistics (number of likes or number of followers), verbatims, and other media content. Quantitative data is very convenient for an analysis using statistical and numerical methods, but unstructured data such as user comments is much more challenging. To get meaningful information, one has to perform the whole process of information retrieval. It starts with the definition of the data type and data structure. On social media, unstructured data is related to text, images, videos, and sound and we will mostly deal with textual data. Then, the data has to be cleaned and normalized. Only after all these steps can we delve into the analysis. Social media Data type and encoding Comments and conversation are textual data that we retrieve as strings. In brief, a string is a sequence of characters represented by code points. Every string in Python is seen as a Unicode covering the numbers from 0 through 0x10FFFF (1,114,111 decimal). Then, the sequence has to be represented as a set of bytes (values from 0 to 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called encoding. Encoding plays a very important role in natural language processing because people use more and more characters such as emojis or emoticons, which replace whole words and express emotions. Moreover, in many languages, there are accents that go beyond the regular English alphabet. In order to deal with all the processing problems that might be caused by these, we have to use the right encoding, because comparing two strings with different encodings is actually like comparing apples and oranges. The most common one is UTF-8, used by default in Python 3, which can handle any type of character. As a rule of thumb always normalize your data to Unicode UTF-8. Structure of social media data Another question we'll encounter is, What is the right structure for our data? The most natural choice is a list that can store a sequence of data points (verbatims, numbers, and so on). However, the use of lists will not be efficient on large datasets and we'll be constrained to use sequential processing of the data. That is why a much better solution is to store the data in a tabular format in pandas dataframe, which has multiple advantages for further processing. First of all, rows are indexed, so search operations become much faster. There are also many optimized methods for different kinds of processing and above all it allows you to optimize your own processing by using functional programming. Moreover, a row can contain multiple fields with metadata about verbatims, which are very often used in our analysis. It is worth remembering that the dataset in pandas must fit into RAM memory. For bigger datasets, we suggest the use of SFrames. Pre-processing and text normalization Preprocessing is one of the most important parts of the analysis process. It reformats the unstructured data into uniform, standardized form. The characters, words, and sentences identified at this stage are the fundamental units passed to all further processing stages. The quality of the preprocessing has a big impact of the final result on the whole process. There are several stages of the process: from simple text cleaning by removing white spaces, punctuation, HTML tags and special characters up to more sophisticated normalization techniques such as tokenization, stemming or lemmatization. In general, the main aim is to keep all the characters and words that are important for the analysis and, at the same time, get rid of all others, and the text corpus should be maintained in one uniform format. We import all necessary libraries. import re, itertools import nltk from nltk.corpus import stopwords When dealing with raw text, we usually have a set of words including many details we are not interested in, such as whitespace, line breaks, and blank lines. Moreover, many words contain capital letters so programming languages misinterpret for example, "go" and "Go" as two different words. In order to handle such distinctions, we can convert all words to lowercase format with the following steps: Perform basic text mining cleaning. Remove all whitespaces: verbatim = verbatim.strip() Many text processing tasks can be done via pattern matching. We can find words containing a character and replace it with another one or just remove it. Regular expressions give us a powerful and flexible method for describing the character patterns we are interested in. They are commonly used in cleaning punctuation, HTML tags, and URLs paths. 3. Remove punctuation: verbatim = re.sub(r'[^ws]','',verbatim) 4. Remove HTML tags: verbatim = re.sub('<[^<]+?>', '', verbatim) 5. Remove URLs: verbatim = re.sub(r'^https?://.*[rn]*', '', verbatim, flags=re.MULTILINE) Depending on the quality of the text corpus, sometimes there is a need to implement some corrections. This refers to the text sources such as Twitter or forums, where emotions can play a role and the comments contain multiple letters words for example, "happpppy" instead of "happy" 6. Standardize words (remove multiple letters): verbatim = ''.join(''.join(s)[:2] for _, s in itertools.groupby(verbatim)) After removal of punctuation or white spaces, words can be attached. This happens especially when deleting the periods at the end of the sentences. The corpus might look like: "the brown dog is lostEverybody is looking for him". So there is a need to split "lostEverybody" into two separate words. 7. Split attached words: verbatim = " ".join(re.findall('[A-Z][^A-Z]*', verbatim)) Stop words are basically a set of commonly used words in any language: mainly determiners, prepositions, and coordinating conjunctions. By removing the words that are very commonly used in a given language, we can focus only on the important words instead, and improve the accuracy of the text processing. 8. Convert text to lowercase, lower(): verbatim = verbatim.lower() 9. Stop word removal: verbatim = ' '.join([word for word in verbatim.split() if word not in (stopwords.words('english'))]) 10. Stemming and lemmatization: The main aim of stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Stemming reduces word forms to so-called stems, whereas lemmatization reduces word forms to linguistically valid lemmas. Some examples of stemming are cars -> car, men -> man, and went -> Go Such text processing can give added value in some domains, and may improve the accuracy of practical information extraction tasks Tokenization: Tokenization is the process of breaking a text corpus up into words (most commonly), phrases, or other meaningful elements, which are then called tokens. The tokens become the basic units for further text processing. tokens = nltk.word_tokenize(verbatim) Other techniques are spelling correction, domain knowledge, and grammar checking. Duplicate removal Depending on data source we might notice multiple duplicates in our dataset. The decision to remove duplicates should be based on the understanding of the domain. In most cases, duplicates come from errors in data collection process and it is recommended to remove them in order to reduce bias in our analysis, with the help of the following: df = df.drop_duplicates(subset=['column_name']) Knowing basic text cleaning techniques, we can now learn how to store the data in an efficient way. For this purpose, we will explain how to use one of the most convenient NoSQL databases—MongoDB. Capture: Once you have made a connection to your API you need to make a special request and receive the data at your end. This step requires you go through the data to be able to understand it. Often the data is received in a special format called JavaScript Object Notation (JSON). JSON was created to enable a lightweight data interchange between programs. The JSON resembles the old XML format and consists of a key-value pair. Normalization: The data received from platforms are not in an ideal format to perform analysis. With textual data there are many different approaches to normalization. One can be stripping whitespaces surrounding verbatims, or converting all verbatims to lowercase, or changing the encoding to UTF-8. The point is that if we do not maintain a standard protocol for normalization, we will introduce many unintended errors. The goal of normalization is to transform all your data in a consistent manner that ensures a uniform standardization of your data. It is recommended that you create wrapper functions for your normalization techniques, and then apply these wrappers on all your data input points so as to ensure that all the data in your analysis go through exactly the same normalization process. In general, one should always perform the following cleaning steps: Normalize the textual content: Normalization generally contains at least the following steps: Stripping surrounding whitespaces. Lowercasing the verbatim. Universal encoding (UTF-8). 2. Remove special characters (example: punctuation). 3. Remove stop words: Irrespective of the language stop words add no additional informative value to the analysis, except in the case of deep parsing where stop words can be bridge connectors between targeted words. 4. Splitting attached words. 5. Removal of URLs and hyperlinks: URLs and hyperlinks can be studied separately, but due to the lack of grammatical structure they are by convention removed from verbatims. 6. Slang lookups: This is a relatively difficult task, because here we would require a predefined vocabulary of slang words and their proper reference words, for example: luv maps to love. Such dictionaries are available on the open web, but there is always a risk of them being outdated. In the case of studying words and not phrases (or n-grams), it is very important to do the following: Tokenize verbatim Stemming and lemmatization (Optional): This is where different written forms of the same word do not hold additional meaning to your study Some advanced cleaning procedures are: Grammar checking: Grammar checking is mostly learning-based, a huge amount of proper text data is learned, and models are created for the purpose of grammar correction. There are many online tools that are available for grammar correction purposes. This is a very tricky cleaning technique because language style and structure can change from source to source (for example language on Twitter will not correspond with the language from published books). Wrongly correcting grammar can have negative effects on the analysis. Spelling correction: In natural language, misspelled errors are encountered. Companies, such as Google and Microsoft have achieved a decent accuracy level in automated spell correction. One can use algorithms such as the Levenshtein Distances, Dictionary Lookup, and so on, or other modules and packages to fix these errors. Again take spell correction with a grain of salt, because false positives can affect the results. Storing: Once the data is received, normalized, and/or cleaned, we need to store the data in an efficient storage database. In this book we have chosen MongoDB as the database as it's a modern and scalable database. It's also relatively easy to use and get started. However, other databases such as Cassandra or HBase could also be used depending on expertise and objectives. Data cleaning and preprocessing, although tedious, can simplify your data analysis work. With the effective Python packages like Numpy, SciPy, Pandas etc these tasks become so much easy and save a lot of your time. If you found this piece of information useful, make sure to check out our book Python Social Media Analytics, which will help you draw actionable insights from mining social media portals such as GitHub, Twitter, YouTube, and more!

0
1
43910

article-image-recode-decode-googlewalkout-interview-shows-why-data-and-evidence-dont-always-lead-to-right-decisions-in-even-the-worlds-most-data-driven-company

Natasha Mathur

23 Nov 2018

10 min read

Recode Decode #GoogleWalkout interview shows why data and evidence don’t always lead to right decisions in even the world’s most data-driven company

Natasha Mathur

23 Nov 2018

10 min read

Earlier this month, 20,000 Google employees along with temps, Vendors, and Contractors walked out of their respective Google offices to protest against the discrimination, racism, and sexual harassment that they encountered at Google’s workplace. As a part of the walkout, Google employees had laid out five demands urging Google to bring about structural changes within the workplace. In the latest episode of Recode Decode with Kara Swisher, yesterday, six of the Google walkout organizers, namely, Erica Anderson, Claire Stapleton, Meredith Whittaker, Stephanie Parker, Cecilia O’Neil-Hart and Amr Gaber spoke out about Google’s dismissive approach towards the five demands laid out by the Google employees. A day after the Walkout, Google addressed these demands in a note written by Sundar Pichai, where he admitted that they have “not always gotten everything right in the past” and are “sincerely sorry”. Pichai also mentioned that “It’s clear that to live up to the high bar we set for Google, we need to make some changes. Going forward, we will provide more transparency into how you raise concerns and how we handle them”. The 'walkout for real change' was a response to the New York Times report, published last month, that exposed how Google has protected its senior executives (Andy Rubin, Android Founder being one of them) that had been accused of sexual misconduct in the recent past. We’ll now have a look at the major highlights from the podcast. Key Takeaways The podcast talks about how the organizers formulated their demands, the rights of contractors at Google, post walkout town hall meeting, and what steps will be taken next by the Google employees. How the walkout mobilized collective action and the formulation of demands As per the Google employees, collating demands was a collective effort from the very beginning. They were inspired by stories of sexual harassment at Google that were floating around in an internal email chain. This urged the organizers of the walkout to send out an email to a large group of women stating that they need to do something about it, to which a lot of employees suggested that they should put out their demands. A doc was prepared in Google Doc Live that listed all the suggested demands by the fellow Googlers. “it was just this truly collective action, living, moving in a Google Document that we were all watching and participating in” said Cecelia O’Neil Hart, a marketer at YouTube. Cecelia also pointed out that the demands that were being collected were not new and had represented the voices of a lot of groups at Google. “It was just completely a process of defining what we wanted in solidarity with each other. I think it showed me the power of collective action, writing the demands quite literally as a collective” said Cecelia. Rights of Contractors One of the demands laid out by the Google employees as a part of the walkout, states, “commitment to ending pay and opportunity inequity for all levels of the organization”. They expected a change that is applicable to not just full-time employees, but also contract workers as well as subcontract workers, as they are the ones who work at Google with rights that are restricted and different than those of the full-time employees. “We have contractors that manage teams of upwards of 10, 20, even more, other people but left in this second-class state where they don’t have healthcare benefits, they don’t have paid sick leave and they definitely don’t get access to the same well-being resources: Counseling, professional development, any of that”, adds Stephanie Parker, a policy specialist on Trust and Safety, YouTube. Other examples of discrimination against contractors at Google include the shooting at YouTube Headquarters in April where contractor workers (security guards, cafeteria workers, etc) were excluded from the post-shooting town hall meeting conducted by Susan Wojcicki, CEO, YouTube. Also, while the shooting was taking place, all the employees were being updated on the Security via texts, except the contractors. Similarly, the contractors were not allowed in the town hall meeting that was conducted six days post walkout, although the demands applied to them just as much as it did to full-time employees. There’s also systemic racism in hiring and promotion for certain job ladders like engineering, versus other job ladders, versus contract work. Parker mentioned that by including contractors in the five demands, they wanted to bring it to everyone’s attention that despite Google striving to be a company with the best workplace that offers the best benefits, it’s quite far-off from leading in that space. “The solution is to convert them to full-time or to treat them fairly with respect. Not to throw up our hands and say, “Oh well” said Parker. Post walkout town hall meeting Six days after the walkout, a mail was sent over to the employees regarding the town hall meeting, which Google said was accidentally “leaked”. Stapleton, a marketing manager at YouTube, says that the “the town hall was really tough to watch” and that the Google executives “did not ever address, acknowledge, the list of demands nor did they adequately provide solutions to all the five. They did drop forced arbitration, but for sexual harassment only, not discrimination, which was a key omission”. As per the employees, Google seemed to use the same old methods to get the situation under control. Google said that they’ll be focusing on committing to the OKRs (Objective and Key Result) i.e. the main goal for the company as a whole. Moreover, they also tried to play down the other concerns and core issues such as discrimination (apart from sexual), racism, and the abuse of power while only focussing on one kind of behavior i.e. sexual assault. They mentioned how Google refused to address any issues surrounding the TVCs (temps, vendors, and contractors), despite being asked about it in the town hall. Also, Google did not acknowledge that the HR processes and systems within the company are not working. Instead, Google decided to conduct a survey to ensure how people really feel about the HR teams within the workplace. “They heard loud and clear from 20,000 of us that these processes and reporting lines that are in place are set up the wrong way and need to be redesigned so that we normal employees have more of a say and more of a look into the decision-making processes, and they didn’t even acknowledge that as a valid sentiment or idea”, said Parker. All in all, there wasn’t much “leadership”, and there wasn’t an understanding that “accountability was necessary”. Employees want their demands to be met Employees want an employee representative on board to speak on behalf of all the employees. They want accountability systems in place and for Google to begin analyzing the cultures within companies that use racism, discrimination, abuse of power, sexism, the kind that excludes many from power and accrue resources to only a few. The employees acknowledge that Google is continuing to discuss and talk about the issue, but that the employees would have to keep pushing the conversation forward every step of the way. “I think we need to not be afraid to say the real words. I want to hear our execs say the real words like “discrimination,” which was erased from their response to the demands. Like ‘systemic racism’.I want to hear those real words” said Cecelia. Employees also want the demand no. 2 i.e. ending pay inequity specifically to be addressed by Google as all they keep getting in response is that Google is “looking into it” and “studying” about it. “I think that what they have to do is embrace the tough critique that they’ve gotten and try to understand where we’re coming from and make these changes, and make them in collaboration with us, which has not happened,” said Stapleton. Employees continue to be cautiously hopeful Employees believe that Google has incredible people at the company. Thousands of people came together and worked on their vision for the world altogether on something that really mattered. “You know, we’ve called this the ‘Walkout for Real Change’ for a reason. Even if all of our optimism comes true and the best outcome and our demands are met, real change happens over time and we’re going to hold people accountable to that real change actually going down, and hold us accountable for demanding it also, because we’ve got to get the rest of the demands met”, says Cecelia. Our thoughts on this topic Just as history has proven time and again, information and data can be used to drive a narrative that benefits the storyteller and their agendas. Based on collecting feedback from workers across the company, the Google walkout organizers pointed out systemic issues within the company that enabled the sexual predatory behavior. They pointed out that sexual harassment is one of the symptoms and not the cause. They demanded that the root causes be addressed holistically through their set of five demands. To extinguish a movement or dissension in its infancy, regimes and corporations throughout history have used the following tactics: Be the benevolent ruler Divide and conquer the crowd by appealing to individual group needs but never to everyone’s collective demands Find a middle ground by agreeing to some demands while signaling that the other side also takes a few steps forward thereby disengaging those whose demands aren’t met. This would weaken the movement’s leadership Use the information to support the status quo. Promote the influencers into top management roles It appears that Google is using a lot of the approaches to appease the walkout participants. The Google management adopted classic labor negotiation tactics by sanctioning the protest, also encouraging managers to participate, then agreeing to adopt the easiest item on the list of demands which have already been implemented in some other tech companies but restricted it to only their employees. But restricting the reforms to only their employees, and creating a larger distance for TVCs, they seem to be thinning out the protesting crowd. By not engaging in open dialog on all key issues highlighted and by removing key decision makers on top out of the town hall, they have created a situation for deniability. Lastly, by going back to surveying sentiments on key issues, they are not only relying on time to subdue anger felt but also on the grassroots voice to dissipate. Will this be the tipping point for Google employees to unionize? BuzzFeed Report: Google’s sexual misconduct policy “does not apply retroactively to claims already compelled to arbitration” OK Google, why are you ok with mut(at)ing your ethos for Project DragonFly? Following Google, Facebook changes its forced arbitration policy for sexual harassment claims

0
0
43899

article-image-artificial-intelligence-data-science-and-big-data-in-2019-what-really-mattered

Richard Gall

16 Dec 2019

6 min read

Artificial intelligence, data science, and big data in 2019: what really mattered

Richard Gall

16 Dec 2019

6 min read

The techlash hasn’t died down - it’s just become normalized. Barely a day passes without a new scandal emerging, from questionable surveillance to racist AI algorithms. But it hasn’t all been bad: while negatives get a lot of attention (and so they should - the consequences of tech can be lethal, both societally and literally), there was still plenty to get excited about. And for those working in the data profession - as analysts, scientists, and engineers, there were several important trends that really helped to define where we are now from a purely practical perspective - as well as hinting at where we might go in the future. With just a few weeks left to go of the year (and the decade!), let’s look at some of the key things that defined this year in the field of data science and data engineering. The growth of PyTorch TensorFlow is undoubtedly the most popular deep learning framework. You might even say that its role in popularizing deep learning and artificial intelligence has been understated. But while TensorFlow has held its place for some time, 2019 was the year when things started to change. Look, for example at this Google Trends graph (and yes, I know it’s not in any way scientific): As you can see TensorFlow hit its stride pretty early on. It’s only in the last 12 months or so that PyTorch has been narrowing the gap. One of the reasons for this is the fact that PyTorch 1.0 was released at the end of last year. This has been the foundation that has spurred its growth over the last 12 months, effectively announcing its ‘official’ arrival on the scene. With Facebook (PyTorch’s creator) building on this foundation throughout the year with a few small but important releases. PyTorch 1.3, for example, which was released at the PyTorch Developer Conference in October, included a number of ‘experimental’ new features, including named tensors and PyTorch Mobile. Another reason for PyTorch’s growth this year is that it is finding traction in the research field. This article provides some hard data that proves that PyTorch is starting to grow in this area, citing the tool’s comparable simplicity, API and performance as the reasons that it’s undermining TensorFlow’s utter dominance of the field. Find our PyTorch bundle, and other data bundles, here. Grab 5 titles for just $25. TensorFlow 2.0 While PyTorch has grown significantly in 2019, TensorFlow is nevertheless still holding its place at the top of the deep learning rankings. And TensorFlow 2.0 has undoubtedly cemented its position. With the alpha release getting developers excited since March, the full launch of 2.0 marked an important milestone for the project. The key difference between TensorFlow 2.0 and 1.0 is ultimately accessibility and ease of use. Despite its massive popularity, TensorFlow 1.0 always had a reputation for being a little more difficult to use than many other deep learning tools. The team were clearly aware of this and have done a lot to make life easier for TensorFlow developers. “With tight integration of Keras into TensorFlow, eager execution by default, and Pythonic function execution,” the team write in the release notes, “TensorFlow 2.0 makes the experience of developing applications as familiar as possible for Python developers.” When placed alongside the exciting development of PyTorch, it’s clear that these two tools are going to be defining deep learning in the year - or years - to come. Get up to date with what's new in TensorFlow 2.0 with TensorFlow 2.0 Quick Start Guide. Stream processing with Kafka, Flink, and others Dealing with large quantities of data in real-time is now the cutting-edge of big data. It’s for this reason that this year we’ve started to see stream processing gain headway in the mainstream. Although it’s been an important technique for organizations with data-intensive needs, the use of cloud and hybrid solutions - as well as an overall awareness of the opportunities of real-time data - has become truly mainstream. In turn, this is giving new prominence to a range of stream-processing platforms. Kafka, Spark, and Flink are just three of the most well-known names in this space, but the market is undoubtedly growing. Another key driver here is Nvidia - as one of the leading hardware companies, it deserves a lot of credit for helping to make massive processing power accessible to organizations that wouldn’t have had a chance just a few years ago. With CUDA, Nvidia’s parallel programming paradigm for GPUs, the company is helping all sorts of users to leverage stream processing in different ways. Get started with Apache Kafka with Apache Kafka Quick Start Guide. Data analysis on the cloud Although I've already mentioned how influential TensorFlow was in popularizing deep learning, today public cloud is going even further. It’s making artificial intelligence and analytics accessible to new roles (thinking here about tools like Azure Machine Learning Studio and Amazon SageMaker), as well as making it easier to build and deploy machine learning models in applications and products. In recent weeks, Microsoft has made another step in its bid to eat into AWS’s market share with Azure Synapse. Essentially a next generation Azure SQL Warehouse, Synapse is designed to bridge the gap between data lake and data warehouse - so, offering massive scale, and improving analytical speed. It will be interesting to see how this plays with the wider market. AWS might respond with something similar - but the onus remains on Microsoft to shift mindshare; AWS will want to consolidate its powerful position. Security It would be wrong to suggest that security is a new issue in the world of data engineering and analytics. But in 2019 it’s become almost impossible to think about the two domains as separate from one another. This cuts two different ways: on the one hand the emphasis on securing data and protecting privacy has never been greater. On the other hand, artificial intelligence and machine learning have started to play a critical part in the way that we monitor and identify threats to our systems. To a certain extent this expresses the double bind that data poses: the amount of data at our disposal is a nightmare from a governance and architectural perspective, but it is, at the same time, a way of mitigating that very nightmare. All in all, then, a bit of a vicious cycle, but nevertheless a reminder that however big our data gets, and however much we try to automate, there will always be a need for humans to think creatively and strategically about how we actually go about solving problems. Explore Packt's security bundles now. For more technology eBooks and videos to prepare you for 2020, head to the Packt store.

0
0
43239

article-image-10-machine-learning-algorithms

Aaron Lazar

30 Nov 2017

7 min read

10 machine learning algorithms every engineer needs to know

Aaron Lazar

30 Nov 2017

7 min read

When it comes to machine learning, it's all about the algorithms. But although machine learning algorithms are the bread and butter of a data scientists job role, it's not always as straightforward as simply picking up an algorithm and running with it. Algorithm selection is incredibly important and often very challenging. There's always a number of things you have to take into consideration, such as: Accuracy: While accuracy is important, it’s not always necessary. In many cases, an approximation is sufficient, in which case, one shouldn’t look for accuracy while giving up on the processing time. Training time: This goes hand in hand with accuracy and is not the same for all algorithms. The training time might go up if there are more parameters as well. When time is a big constraint, you should choose an algorithm wisely. Linearity: Algorithms that follow linearity assume that the data trends follow a linear path. While this is good for some problems, for others it can result in lowered accuracy. Once you've taken those 3 considerations on board you can start to dig a little deeper. Kaggle did a survey in 2017 asking their readers which algorithms - or 'data science methods' more broadly - respondents were most likely to use at work. Below is a screenshot of the results. Kaggle's research offers a useful insight into the algorithms actually being used by data scientists and data analysts today. But we've brought together the types of machine learning algorithms that are most important. Every algorithm is useful in different circumstances - the skill is knowing which one to use and when. 10 machine learning algorithms Linear regression This is clearly one of the most interpretable ML algorithms. It requires minimal tuning and is easy to explain, being the key reason for its popularity. It shows the relationship between two or more variables and how a change in one of the dependent variables impacts the independent variable. It is used for forecasting sales based on trends, as well as for risk assessment. Although with a relatively low level of accuracy, a few parameters needed and lesser training times makes it’s quite popular among beginners. Logistic regression Logistic regression is typically viewed as a special form of Linear Regression, where the output variable is categorical. It’s generally used to predict a binary outcome i.e.True or False, 1 or 0, Yes or No, for a set of independent variables. As you would have already guessed, this algorithm is generally used when the dependent variable is binary. Like to Linear regression, logistic regression has a low level of accuracy, fewer parameters and lesser training times. It goes without saying that it’s quite popular among beginners too. Decision trees These algorithms are mainly decision support tools that use tree-like graphs or models of decisions and possible consequences, including outcomes based on chance-event, utilities, etc. To put it in simple words, you can say decision trees are the least number of yes/no questions to be asked, in order to identify the probability of making the right decision, as often as possible. It lets you tackle the problem at hand in a structured, systematic way to logically deduce the outcome. Decision Trees are excellent when it comes to accuracy but their training times are a bit longer as compared to other algorithms. They also require a moderate number of parameters, making them not so complicated to arrive at a good combination. Naive Bayes This is a type of classification ML algorithm that’s based on the popular probability theorem by Bayes. It is one of the most popular learning algorithms. It groups similarities together and is usually used for document classification, facial recognition software or for predicting diseases. It generally works well when you have a medium to large data set to train your models. These have moderate training times and make use of linearity. While this is good, linearity might also bring down accuracy for certain problems. They also do not bank on too many parameters, making it easy to arrive at a good combination, although at the cost of accuracy. Random forest Without a doubt, this one is a popular go-to machine learning algorithm that creates a group of decision trees with random subsets of the data. It uses the ML method of classification and regression. It is simple to use, as just a few lines of code are enough to implement the algorithm. It is used by banks in order to predict high-risk loan applicants or even by hospitals to predict whether a particular patient is likely to develop a chronic disease or not. With a high accuracy level and moderate training time, it is quite efficient to implement. Moreover, it has average parameters. K-Means K-Means is a popular unsupervised algorithm that is used for cluster analysis and is an iterative and non-deterministic method. It operates on a given dataset through a predefined number of clusters. The output of a K-Means algorithm will be k clusters, with input data partitioned among these clusters. Biggies like Google use K-means to cluster pages by similarities and discover the relevance of search results. This algorithm has a moderate training time and has good accuracy. It doesn’t consist of many parameters, meaning that it’s easy to arrive at the best possible combination. K nearest neighbors K nearest neighbors is a very popular machine learning algorithm which can be used for both regression as well as classification, although it’s majorly used for the latter. Although it is simple, it is extremely effective. It takes little to no time to train, although its accuracy can be heavily degraded by high dimension data since there is not much of a difference between the nearest neighbor and the farthest one. Support vector machines SVMs are one of the several examples of supervised ML algorithms dealing with classification. They can be used for either regression or classification, in situations where the training dataset teaches the algorithm about specific classes, so that it can then classify the newly included data. What sets them apart from other machine learning algorithms is that they are able to separate classes quicker and with lesser overfitting than several other classification algorithms. A few of the biggest pain points that have been resolved using SVMs are display advertising, image-based gender detection and image classification with large feature sets. These are moderate in their accuracy, as well as their training times, mostly because it assumes linear approximation. On the other hand, they require an average number of parameters to get the work done. Ensemble methods Ensemble methods are techniques that build a set of classifiers and combine the predictions to classify new data points. Bayesian averaging is originally an ensemble method, but newer algorithms include error-correcting output coding, etc. Although ensemble methods allow you to devise sophisticated algorithms and produce results with a high level of accuracy, they are not preferred so much in industries where interpretability of the algorithm is more important. However, with their high level of accuracy, it makes sense to use them in fields like healthcare, where even the minutest improvement can add a lot of value. Artificial neural networks Artificial neural networks are so named because they mimic the functioning and structure of biological neural networks. In these algorithms, information flows through the network and depending on the input and output, the neural network changes in response. One of the most common use cases for ANNs is speech recognition, like in voice-based services. As the information fed to them grows, these algorithms improve. However, artificial neural networks are imperfect. With great power comes longer training times. They also have several more parameters as compared to other algorithms. That being said, they are very flexible and customizable. If you want to skill-up in implementing Machine Learning Algorithms, you can check out the following books from Packt: Data Science Algorithms in a Week by Dávid Natingga Machine Learning Algorithms by Giuseppe Bonaccorso

0
0
43092

article-image-data-professionals-planning-to-learn-this-year-python-deep-learning

Amey Varangaonkar

14 Jun 2018

4 min read

What are data professionals planning to learn this year? Python, deep learning, yes. But also...

Amey Varangaonkar

14 Jun 2018

4 min read

One thing that every data professional absolutely dreads is the day their skills are no longer relevant in the market. In an ever-changing tech landscape, one must be constantly on the lookout for the most relevant, industrially-accepted tools and frameworks. This is applicable everywhere - from application and web developers to cybersecurity professionals. Not even the data professionals are excluded from this, as new ways and means to extract actionable insights from raw data are being found out almost every day. Gone are the days when data pros stuck to a single language and a framework to work with their data. Frameworks are more flexible now, with multiple dependencies across various tools and languages. Not just that, new domains are being identified where these frameworks can be applied, and how they can be applied varies massively as well. A whole new arena of possibilities has opened up, and with that new set of skills and toolkits to work on these domains have also been unlocked. What’s the next big thing for data professionals? We recently polled thousands of data professionals as part of our Skill-Up program, and got some very interesting insights into what they think the future of data science looks like. We asked them what they were planning to learn in the next 12 months. The following word cloud is the result of their responses, weighted by frequency of the tools they chose: What data professionals are planning on learning in the next 12 months Unsurprisingly, Python comes out on top as the language many data pros want to learn in the coming months. With its general-purpose nature and innumerable applications across various use-cases, Python’s sky-rocketing popularity is the reason everybody wants to learn it. Machine learning and AI are finding significant applications in the web development domain today. They are revolutionizing the customers’ digital experience through conversational UIs or chatbots. Not just that, smart machine learning algorithms are being used to personalize websites and their UX. With all these reasons, who wouldn’t want to learn JavaScript, as an important tool to have in their data science toolkit? Add to that the trending web dev framework Angular, and you have all the tools to build smart, responsive front-end web applications. We also saw data professionals taking active interest in the mobile and cloud domains as well. They aim to learn Kotlin and combine its power with the data science tools for developing smarter and more intelligent Android apps. When it comes to the cloud, Microsoft’s Azure platform has introduced many built-in machine learning capabilities, as well as a workbench for data scientists to develop effective, enterprise-grade models. Data professionals also prefer Docker containers to run their applications seamlessly, and hence its learning need seems to be quite high. [box type="shadow" align="" class="" width=""]Has machine learning with JavaScript caught your interest? Don’t worry, we got you covered - check out Hands-on Machine Learning with JavaScript for a practical, hands-on coverage of the essential machine learning concepts using the leading web development language. [/box] With Crypto’s popularity off the roof (sadly, we can’t say the same about Bitcoin’s price), data pros see Blockchain as a valuable skill. Building secure, decentralized apps is on the agenda for many, perhaps. Cloud, Big Data, Artificial Intelligence are some of the other domains that the data pros find interesting, and feel worth skilling up in. Work-related skills that data pros want to learn We also asked the data professionals what skills the data pros wanted to learn in the near future that could help them with their daily jobs more effectively. The following word cloud of their responses paints a pretty clear picture: Valuable skills data professionals want to learn for their everyday work As Machine learning and AI go mainstream, so do their applications in mainstream domains - often resulting in complex problems. Well, there’s deep learning and specifically neural networks to tackle these problems, and these are exactly the skills data pros want to master in order to excel at their work. [box type="shadow" align="" class="" width=""]Data pros want to learn Machine Learning in Python. Do you? Here’s a useful resource for you to get started - check out Python Machine Learning, Second Edition today![/box] So, there it is! What are the tools, languages or frameworks that you are planning to learn in the coming months? Do you agree with the results of the poll? Do let us know. What are web developers favorite front-end tools? Packt’s Skill Up report reveals all Data cleaning is the worst part of data analysis, say data scientists 15 Useful Python Libraries to make your Data Science tasks Easier

0
0
42899

article-image-visualizing-bigquery-data-with-tableau

Sugandha Lahoti

04 Jun 2018

8 min read

Visualizing BigQuery Data with Tableau

Sugandha Lahoti

04 Jun 2018

8 min read

Tableau is an interactive data visualization tool that can be used to create business intelligence dashboards. Much like most business intelligence tools, it can be used to pull and manipulate data from a number of sources. The difference is its dedication to help users create insightful data visualizations. Tableau's drag-and-drop interface makes it easy for users to explore data via elegant charts. It also includes an in-memory engine in order to speed up calculations on extremely large data sets. In today’s tutorial, we will be using Tableau Desktop for visualizing BigQuery Data. [box type="note" align="" class="" width=""]This article is an excerpt from the book, Learning Google BigQuery, written by Thirukkumaran Haridass and Eric Brown. This book is a comprehensive guide to mastering Google BigQuery to get intelligent insights from your Big Data.[/box] The following section explains how to use Tableau Desktop Edition to connect to BigQuery and get the data from BigQuery to create visuals: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery: At this point, all the tables in your dataset should be displayed on the left: You can drag and drop the table you are interested in using to the middle section labeled Drop Tables Here. In this case, we want to query the Google Analytics BigQuery test data, so we will click where it says New Custom SQL and enter the following query in the dialog: SELECT trafficsource.medium as Medium, COUNT(visitId) as Visits FROM `google.com:analytics- bigquery.LondonCycleHelmet.ga_sessions_20130910` GROUP BY Medium Now we can click on Update Now to view the first 10,000 rows of our data. We can also do some simple transformations on our columns, such as changing string values to dates and many others. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. Tableau's interface allows users to simply drag and drop dimensions and metrics from the left side of the report into the central part to create simple text charts, with a feel much like Excel's pivot chart functionality. This makes Tableau easy to transition to for Excel users. From the Dimensions section on the left-hand-side navigation, drag and drop the Medium dimension into the sheet section. Then drag the Visits metric in the Metric section on the left-hand-side navigation to the Text sub-section in the Marks section. This will create a simple text chart with data from the original query: On the right, click on the button marked Show Me. This should bring up a screen with icons for each graph type that can be created in Tableau: Tableau helps by shading graph types that are not available based on the data that is currently selected in the report. It will also make suggestions based on the data available. In this case, a bar chart has been preselected for us as our data is a text dimension and a numeric metric. Click on the bar chart. Once clicked, the default sideways bar chart will appear with the data we have selected. Click on the Swap Rows and Columns in the icon bar at the top of the screen to flip the chart from horizontal to vertical: Map charts in Tableau One of Tableau's strengths is its ease of use when creating a number of different types of charts. This is true when creating maps, especially because maps can be very painful to create using other tools. Here is the way to create a simple map in Tableau using BigQuery public data. The first few steps are the same as in the preceding example: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery. At this point, all the tables in your dataset should be displayed on the left-hand side. Click where it says New Custom SQL and enter the following query in the dialog: SELECT zipcode, SUM(population) AS population FROM `bigquery-public- data.census_bureau_usa.population_by_zip_2010` GROUP BY zipcode ORDER BY population desc This data is from the United States Census from 2010. The query returns all zip codes in USA, sorted by most populous to least populous. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. Double-click on the zipcode dimension on the dimensions section on the left navigation. Clicking on a dimension of zip codes (or any other formatted location dimension such as latitude/longitude, country names, state names, and so on) will automatically create a map in Tableau: Drag the population metric from the metrics section on the left navigation and drop it on the color tab in the marks section: The map will now show the most populous zip codes shaded darker than the less populous zip codes. The map chart also includes zoom features in order to make dealing with large maps easy. In the top-left corner of the map, there is a magnifying glass icon. This icons has the map zoom features. Clicking on the arrow at the bottom of this icon opens more features. The icon with a rectangle and a magnifying glass is the selection tool (The first icon to the right of the arrow when hovering over arrow): Click on this icon and then on the map to select a section of the map to be zoomed into: This image is shown after zooming into the California area of the United States. The map now shows the areas of the state that are the most populous. Create a word cloud in Tableau Word clouds are great visualizations for finding words that are most referenced in books, publications, and social media. This section will cover creating a word cloud in Tableau using BigQuery public data. The first few steps are the same as in the preceding example: After opening Tableau Desktop, select Google BigQuery under the Connect To a Server section on the left; then enter your login credentials for BigQuery. At this point, all the tables in your dataset should be displayed on the left. Click where it says New Custom SQL and enter the following query in the dialog: SELECT word, SUM(word_count) word_count FROM `bigquery-public-data.samples.shakespeare` GROUP BY word ORDER BY word_count desc The dataset is from the works of William Shakespeare. The query returns a list of all words in his works, along with a count of the times each word appears in one of his works. At the bottom, click on the tab titled Sheet 1 to enter the worksheet view. In the dimensions section, drag and drop the word dimension into the text tab in the marks section. In the dimensions section, drag and drop the word_count measure to the size tab in the marks section. There will be two tabs used in the marks section. Right-click on the size tab labeled word and select Measure | Count: This will create what is called a tree map. In this example, there are far too many words in the list to utilize the visualization. Drag and drop the word_count measure from the measures section to the filters section. When prompted with How do you want to filter on word_count, select Sum and click on next.. Select At Least for your condition and type 2000 in the dialog. Click on OK. This will return only those words that have a word count of at least 2,000.. Use the dropdown in the marks card to select Text: 11. Drag and drop the word_count measure from the measures section to the color tab in the marks section. This will color each word based on the count for that word: You should be left with a color-coded word cloud. Other charts can now be created as individual worksheet tabs. Tabs can then be combined to make what Tableau calls a dashboard. The process of creating a dashboard here is a bit more cumbersome than creating a dashboard in Google Data Studio, but Tableau offers a great deal of more customization for its dashboards. This, coupled with all the other features it offers, makes Tableau a much more attractive option, especially for enterprise users. We learnt various features of Tableau and how to use it for visualizing BigQuery data.To know about other third party tools for reporting and visualization purposes such as R and Google Data Studio, check out this book Learning Google BigQuery. Tableau is the most powerful and secure end-to-end analytics platform - Interview Insights Tableau 2018.1 brings new features to help organizations easily scale analytics Getting started with Data Visualization in Tableau

0
0
42782

article-image-googlewalkout-demanded-a-truly-equity-culture-for-everyone-pichai-shares-a-comprehensive-plan-for-employees-to-safely-report-sexual-harassment

Melisha Dsouza

09 Nov 2018

4 min read

#GoogleWalkout demanded a ‘truly equitable culture for everyone’; Pichai shares a “comprehensive” plan for employees to safely report sexual harassment

Melisha Dsouza

09 Nov 2018

4 min read

Last week, 20,000 Google employees along with Temps, Vendors, and Contractors walked out to protest the discrimination, racism, and sexual harassment that they encountered at Google’s workplace. This global walkout by Google workers was a response to the New York times report on Google published last month, shielding senior executives accused of sexual misconduct. Yesterday, Google addressed these demands in a note written by Sundar Pichai to their employees. He admits that they have “not always gotten everything right in the past” and they are “sincerely sorry” for the same. This supposedly ‘comprehensive’ plan will provide more transparency into how employees raise concerns and how Google will handle them. Here are some of the major changes that caught our attention: Following suite after Uber and Microsoft, Google has eliminated forced arbitration in cases of sexual harassment. Fostering a more transparent nature in reporting a sexual harassment case, employees can now be accompanied with support persons to the meetings with HR. Google is planning to update and expand their mandatory sexual harassment training. They will now be conducting these annually instead of once in two years. If an employee fails to complete his/her training, they will receive a one-rating dock in the employees performance review system. This applies to senior management as well where they could be downgraded from ‘exceeds expectation’ to ‘meets expectation’. They will turn increase focus towards diversity, equity and inclusion in 2019, through hiring, progression and retention, in order to create a more inclusive culture for everyone. Google found that one of the most common factors among the harassment complaints is that the perpetrator was under the influence of alcohol (~20% of cases). Stating the policy again, the plan mentions that excessive consumption of alcohol is not permitted when an employee is at work, performing Google business, or attending a Google-related event, whether onsite or offsite. Going forward, all leaders at the company will be expected to create teams, events, offsites and environments in which excessive alcohol consumption is strongly discouraged. They will be expected to follow the two-drink rule. Although the plan is a step towards making workplace conditions stable, it does leave out some of the more inherent concerns related to structural changes as stated by the organizers of the Google Walkout. For example, the structural inequity that separates ‘full time’ employees from contract workers. Contract workers make up more than half of Google’s workforce, and perform essential roles across the company. However, they receive few of the benefits associated with tech company employment. They are also largely women, people of color, immigrants, and people from working class backgrounds. “We demand a truly equitable culture, and Google leadership can achieve this by putting employee representation on the board and giving full rights and protections to contract workers, our most vulnerable workers, many of whom are Black and Brown women.” -Google Walkout Organizer Stephanie Parker Google’s plan to bring transparency at the workplace looks like a positive step towards improving their workplace culture. It would be interesting to see how the plan works out for Google’s employees, as well as other organizations using this as an example to maintain a peaceful workplace environment for their workers. You can head over to Medium.com to read the #GoogleWlakout organizers’ response to the update. Head over to Pichai’s blog post for details on the announcement itself. Technical and hidden debts in machine learning – Google engineers’ give their perspective 90% Google Play apps contain third-party trackers, share user data with Alphabet, Facebook, Twitter, etc: Oxford University Study OK Google, why are you ok with mut(at)ing your ethos for Project DragonFly?

0
0
42628

article-image-9-useful-r-packages-for-nlp-text-mining

Amey Varangaonkar

18 Dec 2017

6 min read

9 Useful R Packages for NLP & Text Mining

Amey Varangaonkar

18 Dec 2017

6 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering Text Mining with R, co-authored by Ashish Kumar and Avinash Paul. This book lists various techniques to extract useful and high-quality information from your textual data.[/box] There is a wide range of packages available in R for natural language processing and text mining. In the article below, we present some of the popular and widely used R packages for NLP: OpenNLP OpenNLP is an R package which provides an interface, Apache OpenNLP, which is a machine-learning-based toolkit written in Java for natural language processing activities. Apache OpenNLP is widely used for most common tasks in NLP, such as tokenization, POS tagging, named entity recognition (NER), chunking, parsing, and so on. It provides wrappers for Maxent entropy models using the Maxent Java package. It provides functions for sentence annotation, word annotation, POS tag annotation, and annotation parsing using the Apache OpenNLP chunking parser. The Maxent Chunk annotator function computes the chunk annotation using the Maxent chunker provided by OpenNLP. The Maxent entity annotator function in R package utilizes the Apache OpenNLP Maxent name finder for entity annotation. Model files can be downloaded from http://opennlp.sourceforge.net/models-1.5/. These language models can be effectively used in R packages by installing the OpenNLPmodels.language package from the repository at http://datacube.wu.ac.at. Get the OpenNLP package here. Rweka The RWeka package in R provides an interface to Weka. Weka is an open source software developed by a machine learning group at the University of Wakaito, which provides a wide range of machine learning algorithms which can either be directly applied to a dataset or it can be called from a Java code. Different data-mining activities, such as data processing, supervised and unsupervised learning, association mining, and so on, can be performed using the RWeka package. For natural language processing, RWeka provides tokenization and stemming functions. RWeka packages provide an interface to Alphabetic, NGramTokenizers, and wordTokenizer functions, which can efficiently perform tokenization for contiguous alphabetic sequence, string-split to n-grams, or simple word tokenization, respectively. Get started with Rweka here. RcmdrPlugin.temis The RcmdrPlugin.temis package in R provides a graphical integrated text-mining solution. This package can be leveraged for many text-mining tasks, such as importing and cleaning a corpus, terms and documents count, term co-occurrences, correspondence analysis, and so on. Corpora can be imported from different sources and analysed using the importCorpusDlg function. The package provides flexible data source options to import corpora from different sources, such as text files, spreadsheet files, XML, HTML files, Alceste format and Twitter search. The Import function in this package processes the corpus and generates a term-document matrix. The package provides different functions to summarize and visualize the corpus statistics. Correspondence analysis and hierarchical clustering can be performed on the corpus. The corpusDissimilarity function helps analyse and create a crossdissimilarity table between term-documents present in the corpus. This package provides many functions to help the users explore the corpus. For example, frequentTerms to list the most frequent terms of a corpus, specificTerms to list terms most associated with each document, subsetCorpusByTermsDlg to create a subset of the corpus. Term frequency, term co-occurrence, term dictionary, temporal evolution of occurrences or term time series, term metadata variables, and corpus temporal evolution are among the other very useful functions available in this package for text mining. Download the package from CRAN page. tm The tm package is a text-mining framework which provides some powerful functions which will aid in text-processing steps. It has methods for importing data, handling corpus, metadata management, creation of term document matrices, and preprocessing methods. For managing documents using the tm package, we create a corpus which is a collection of text documents. There are two types of implementation, volatile corpus (VCorpus) and permanent corpus (PCropus). VCorpus is completely held in memory and when the R object is destroyed the corpus is gone. PCropus is stored in the filesystem and is present even after the R object is destroyed; this corpus can be created by using the VCorpus and PCorpus functions respectively. This package provides a few predefined sources which can be used to import text, such as DirSource, VectorSource, or DataframeSource. The getSources method lists available sources, and users can create their own sources. The tm package ships with several reader options: readPlain, readPDF, and readDOC. We can execute the getReaders method for an up-to-date list of available readers. To write a corpus to the filesystem, we can use writeCorpus. For inspecting a corpus, there are methods such as inspect and print. For transformation of text, such as stop-word removal, stemming, whitespace removal, and so on, we can use the tm_map, content_transformer, tolower, stopwords("english") functions. For metadata management, meta comes in handy. The tm package provides various quantitative function for text analysis, such as DocumentTermMatrix , findFreqTerms, findAssocs, and removeSparseTerms. Download the tm package here. languageR languageR provides data sets and functions for statistical analysis on text data. This package contains functions for vocabulary richness, vocabulary growth, frequency spectrum, also mixed-effects models and so on. There are simulation functions available: simple regression, quasi-F factor, and Latin-square designs. Apart from that, this package can also be used for correlation, collinearity diagnostic, diagnostic visualization of logistic models, and so on. koRpus The koRpus package is a versatile tool for text mining which implements many functions for text readability and lexical variation. Apart from that, it can also be used for basic level functions such as tokenization and POS tagging. You can find more information about its current version and dependencies here. RKEA The RKEA package provides an interface to KEA, which is a tool for keyword extraction from texts. RKEA requires a keyword extraction model, which can be created by manually indexing a small set of texts, using which it extracts keywords from the document. maxent The maxent package in R provides tools for low-memory implementation of multinomial logistic regression, which is also called the maximum entropy model. This package is quite helpful for classification processes involving sparse term-document matrices, and low memory consumption on huge datasets. Download and get started with maxent. lsa Truncated singular vector decomposition can help overcome the variability in a term-document matrix by deriving the latent features statistically. The lsa package in R provides an implementation of latent semantic analysis. The ease of use and efficiency of R packages can be very handy when carrying out even the trickiest of text mining task. As a result, they have grown to become very popular in the community. If you found this post useful, you should definitely refer to our book Mastering Text Mining with R. It will give you ample techniques for effective text mining and analytics using the above mentioned packages.

0
1
42617

article-image-data-science-vs-machine-learning-understanding-the-difference-and-what-it-means-today

Richard Gall

02 Sep 2019

8 min read

Data science vs. machine learning: understanding the difference and what it means today

Richard Gall

02 Sep 2019

8 min read

One of the things that I really love about the tech industry is how often different terms - buzzwords especially - can cause confusion. It isn’t hard to see this in the wild. Quora is replete with confused people asking about the difference between a ‘developer’ and an ‘engineer’ and how ‘infrastructure’ is different from ‘architecture'. One of the biggest points of confusion is the difference between data science and machine learning. Both terms refer to different but related domains - given their popularity it isn’t hard to see how some people might be a little perplexed. This might seem like a purely semantic problem, but in the context of people’s careers, as they make decisions about the resources they use and the courses they pay for, the distinction becomes much more important. Indeed, it can be perplexing for developers thinking about their career - with machine learning engineer starting to appear across job boards, it’s not always clear where that role begins and ‘data scientist’ begins. Tl;dr: To put it simply - and if you can’t be bothered to read further - data science is a discipline or job role that’s all about answering business questions through data. Machine learning, meanwhile, is a technique that can be used to analyze or organize data. So, data scientists might well use machine learning to find something out, but it would only be one aspect of their job. But what are the implications of this distinction between machine learning and data science? What can the relationship between the two terms tell us about how technology trends evolve? And how can it help us better understand them both? Read next: 9 data science myths debunked What’s causing confusion about the difference between machine learning and data science? The data science v machine learning confusion comes from the fact that both terms have a significant grip on the collective imagination of the tech and business world. Back in 2012 the Harvard Business Review declared data scientist to be the ‘sexiest job of the 21st century’. This was before the machine learning and artificial intelligence boom, but it’s the point we need to go back to understand how data has shaped the tech industry as we know it today. Data science v machine learning on Google Trends Take a look at this Google trends graph: Both terms broadly received a similar level of interest. ‘Machine learning’ was slightly higher throughout the noughties and a larger gap has emerged more recently. However, despite that, it’s worth looking at the period around 2014 when ‘data science’ managed to eclipse machine learning. Today, that feels remarkable given how machine learning is a term that’s extended out into popular consciousness. It suggests that the HBR article was incredibly timely, identifying the emergence of the field. But more importantly, it’s worth noting that this spike for ‘data science’ comes at the time that both terms surge in popularity. So, although machine learning eventually wins out, ‘data science’ was becoming particularly important at a time when these twin trends were starting to grow. This is interesting, and it’s contrary to what I’d expect. Typically, I’d imagine the more technical term to take precedence over a more conceptual field: a technical trend emerges, for a more abstract concept to gain traction afterwards. But here the concept - the discipline - spikes just at the point before machine learning can properly take off. This suggests that the evolution and growth of machine learning begins with the foundations of data science. This is important. It highlights that the obsession with data science - which might well have seemed somewhat self-indulgent - was, in fact, an integral step for business to properly make sense of what the ‘big data revolution’ (a phrase that sounds eighty years old) meant in practice. Insofar as ‘data science’ is a term that really just refers to a role that’s performed, it’s growth was ultimately evidence of a space being carved out inside modern businesses that gave a domain expert the freedom to explore and invent in the service of business objectives. If that was the baseline, then the continued rise of machine learning feels inevitable. From being contained in computer science departments in academia, and then spreading into business thanks to the emergence of the data scientist job role, we then started to see a whole suite of tools and use cases that were about much more than analytics and insight. Machine learning became a practical tool that had practical applications everywhere. From cybersecurity to mobile applications, from marketing to accounting, machine learning couldn’t be contained within the data science discipline. This wasn’t just a conceptual point - practically speaking, a data scientist simply couldn’t provide support to all the different ways in which business functions wanted to use machine learning. So, the confusion around the relationship between machine learning and data science stems from the fact that the two trends go hand in hand - or at least they used to. To properly understand how they’re different, let’s look at what a data scientist actually does. Read next: Data science for non-techies: How I got started (Part 1) What is data science, exactly? I know you’re not supposed to use Wikipedia as a reference, but the opening sentence in the entry for ‘data science’ is instructive: “Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.” The word that deserves your attention is multi-disciplinary as this underlines what makes data science unique and why it stands outside of the more specific taxonomy of machine learning terms. Essentially, it’s a human activity as much as a technical one - it’s about arranging, organizing, interpreting, and communicating data. To a certain extent it shares a common thread of DNA with statistics. But although Nate Silver said that ‘data scientist’ was “a sexed up term for statistician”, I think there are some important distinctions. To do data science well you need to be deeply engaged with how your work integrates with the wider business strategy and processes. The term ‘statistics’ - like ‘machine learning’ - doesn’t quite do this. Indeed, to a certain extent this has made data science a challenging field to work in. It isn’t hard to find evidence that data scientists are trying to leave their jobs, frustrated with how their roles are being used and how they integrate into existing organisational structures. How do data scientists use machine learning? As a data scientist, your job is to answer questions. These are questions like: What might happen if we change the price of a product in this way? What do our customers think of our products? How often do customers purchase products? How are customers using our products? How can we understand the existing market? How might we tackle it? Where could we improve efficiencies in our processes? That’s just a small set. The types of questions data scientists will be tackling will vary depending on the industry, their company - everything. Every data science job is unique. But whatever questions data scientists are asking, it’s likely that at some point they’ll be using machine learning. Whether it’s to analyze customer sentiment (grouping and sorting) or predicting outcomes, a data scientist will have a number of algorithms up their proverbial sleeves ready to tackle whatever the business throws at them. Machine learning beyond data science The machine learning revolution might have started in data science, but it has rapidly expanded far beyond that strict discipline. Indeed, one of the reasons that some people are confused about the relationship between the two concepts is because machine learning is today touching just about everything, like water spilling out of its neat data science container. Machine learning is for everyone Machine learning is being used in everything from mobile apps to cybersecurity. And although data scientists might sometimes play a part in these domains, we’re also seeing subject specific developers and engineers taking more responsibility for how machine learning is used. One of the reasons for this is, as I mentioned earlier, the fact that a data scientist - or even a couple of them - can’t do all the things that a business might want them to when it comes to machine learning. But another is the fact that machine learning is getting easier. You no longer need to be an expert to employ machine learning algorithms - instead, you need to have the confidence and foundational knowledge to use existing machine learning tools and products. This ‘productization’ of machine learning is arguably what’s having the biggest impact on how we understand the topic. It’s even shrinking data science, making it a more specific role. That might sound like data science is less important today than it was in 2014, but it can only be a good thing for data scientists - it means they are being asked to spread themselves so thinly. So, if you've been googling 'data science v machine learning', you now know the answer. The two terms are distinct but they both come out of the 'big data revolution' which we're still living through. Both trends and terms are likely to evolve in the future, but they're certainly not going to disappear - as the data at our disposal grow, making effective use of it is only going to become more important.

0
0
42098

article-image-performing-vehicle-telemetry-job-analysis-with-azure-stream-analytics-tools

Sugandha Lahoti

18 Apr 2018

8 min read

Performing Vehicle Telemetry job analysis with Azure Stream Analytics tools

Sugandha Lahoti

18 Apr 2018

8 min read

This tutorial is a step-by-step blueprint for a Vehicle Telemetry job analysis on Azure using Streaming Analytics tools for Visual Studio.For connected car and real-time predictive Vehicle Telemetry Analysis, there's a necessity to specify opportunities for new solutions. These opportunities include How a car could be shipped globally with the required smart hardware to connect to the internet within the next few years. How the embedded connections could define Vehicle Telemetry predictive health status so automotive companies will be able to collect data on the performance of cars, How to send interactive updates and patches to car's instrumentation remotely, How to avoid car equipment damage with precautionary measures with prior notification All these require an intelligent vehicle health telemetry analysis which you can implement using Azure Streaming. Stream Analytics tools for Visual Studio The Stream Analytics tools for Visual Studio help prepare, build, and deploy real-time events on Azure. Optionally, the tools enable you to monitor the streaming job using local sample data as job input testing as well as real-time monitoring, job metrics, diagram view, and so on. This tool provides a complete development setup for the implementation and deployment of real-world Azure Stream Analytics jobs using Visual Studio. Developing a Stream Analytics job using Visual Studio Post the installation of the Stream Analytics tool, a new stream analytics job can be created in Visual Studio. You can get started in Visual Studio IDE from File | New Project. Under the Templates, select Stream Analytics and choose Azure Stream Analytics Application. 2. Next, the job name, project, and solution location should be provided. Under Solution menu, you may also select options such as Add to solution or Create new instance apart from Create new solution from the available drop-down menu during Visual Studio Stream Analytics job creation: 3. Once the ASA job is created, in the Solution Explorer, the job topology folder structure could be viewed as Inputs (job input), Outputs (job output), JobConfig.json, Script.asaql (Stream Analytics Query file), Azure Functions (optional), and so on: 4. Next, provide the job topology data input and output event source settings by selecting Input.json and Output.json from Inputs and Outputs directories, respectively. 5. For a Vehicle Telemetry Predictive Analysis demo using an Azure Stream Analytics job, we need to take two different job data streams. One should be a Stream type for an illimitable sequence of real-time events processed through Azure Event Hub along with Hub policy name, policy key, event serialization format, and so on: Defining a Stream Analytics query for Vehicle Telemetry job analysis using Stream Analytics tools To assign the streaming analytics query definition, the Script.asasql file from the ASA project should be selected by specifying the data and reference stream input joining operation along with supplying analyzed output to Blob storage as configured in job properties. Query to define Vehicle Telemetry (Connected Car) engine health status and pollution index over cities For connected car and real-time predictive Vehicle Telemetry Analysis, there's a necessity to specify opportunities for new solutions in terms of how a car could be shipped globally with the required smart hardware to connect to the internet within the next few years. How the embedded connections could define Vehicle Telemetry predictive health status so automotive companies will be able to collect data on the performance of cars, to send interactive updates and patches to car's instrumentation remotely, and just to avoid car equipment damage with precautionary measures with prior notification through intelligent vehicle health telemetry analysis using Azure Streaming. The solution architecture of the Co:tUlected Car-Vehicle Telemetry Analysis case study used in this demo, with Azure Stream Analytics for real-time predictive analysis, is as follows: Testing Stream Analytics queries locally or in the cloud Azure Stream Analytics tools in Visual Studio offer the flexibility to execute the queries either locally or directly in the cloud. In the Script.asaql file, you need to provide the respective query of your streaming job and test against local input stream/Reference data for query testing before processing in Azure: 2. To run the Stream Analytics job query locally, first select Add Local Input by right-clicking on the ASA project in VS Solution Explorer, and choose to Add Local Input: 3. Define the local input for each Event Hub Data Stream and Blob storage data and execute the job query locally before publishing it in Azure: 4. After adding each local input test data, you can test the Stream Analytics job query locally in VS editor by clicking on the Run Locally button in the top left corner of VS IDE: Vehicle diagnostic Usage-based insurance Engine emission control Engine performance remapping Eco-driving Roadside assistance call Fleet management So, specify the following schema during the designing of a connected car streaming job query with Stream Analytics using parameters such as Vehicle Index no, Model, outside temperature, engine speed, fuel meter, tire pressure, and brake status, by defining INNER join with Event Hub data streams along with Blob storage reference streams containing vehicle model information: Select input.vin, BlobSource.Model, input.timestamp, input.outsideTemperature, input.engineTemperature, input.speed, input.fuel, input.engineoil, input.tirepressure, input.odometer, input.city, input.accelerator_pedal_position, input.parking_brake_status, input.headlamp_status, input.brake_pedal_status, input.transmission_gear_position, input.ignition_status, input.windshield_wiper_status, input.abs into output from input join BlobSource on input.vin = BlobSource.VIN The query could be further customized for complex event processing analysis in terms of defining windowing concepts like Tumbling window function, which assigns equal length non-overlapping series of events in streams with a fixed time slice. The following Vehicle Telemetry analytics query will specify a smart car health index parameter with complex streams from a specified two-second timestamp interval in the form of a fixed length series of events: select BlobSource.Model, input.city,count(vin) as cars, avg(input.engineTemperature) as engineTemperature, avg(input.speed) as Speed, avg(input.fuel) as Fuel, avg(input.engineoil) as EngineOil,avg(input.tirepressure) as TirePressure, avg(input.odometer) as Odometer into EventHubOut from input join BlobSource on input.vin = BlobSource.VIN group by BlobSource.model, input.city, TumblingWindow(second,2) The following Vehicle Telemetry analytics query will specify a smart car health index parameter with complex streams from a specified two-second timestamp interval in the form of a fixed length series of events: 5. The query could be executed locally or submitted to Azure. While running the job locally, a Command Prompt will appear asserting the local Stream Analytics job's running status, with the output data folder location: 6. If run locally, the job output folder would contain two files in the project disk location within the ASALocalRun directory named, with the current date timestamp. Two output files would be present in .csv and .json formats respectively: Now, if submitted the job to Azure from the Stream Analytics project in Visual Studio, it offers a beautiful job dashboard while providing an interactive job diagram view, job metrics graph, and errors (if any). The Vehicle Telemetry Predictive Health Analytics job dashboard in Visual Studio provides a nice job diagram with Real-Time Insights of events, with a display refreshed at a minimum rate of every 30 minutes: The Stream Analytics job metrics graph provides interactive insights on input and output events, out of order events, late events, runtime errors, and data conversion errors related to the job as appropriate: For Connected Car-Predictive Vehicle Telemetry Analytics, you may configure the data input streams processed with complex events by using a definite timestamp interval in a non-overlapping mode such as Tumbling window over a two-second time slicer. The output sink should be configured as Service Bus Event Hub in a data partitioning unit of 32, with a maximum message retention period of 7 days. The output job sink processed events in Event Hub could be archived as well in Azure blob storage for a long-term infrequent access perspective: The Azure Service Bus, Event Hub job output metrics dashboard view configured for vehicle telemetry analysis is as follows: On the left side of the job dashboard, the Job Summary provides a comprehensive view controller of the job parameters such as job status, creation time, job output start time, start mode, last output timestamp, output error handling mechanism provided for quick reference logs, late event arrival tolerance windows, and so on. The job can be stopped and started, deleted, or even refreshed by selecting icons from the top left menu of the job view dashboard in VS: Optionally, a Stream Analytics complete project clone can also be generated by clicking on the Generate Project icon from the top menu of the job dashboard. This article is an excerpt from the book, Stream Analytics with Microsoft Azure, written by Anindita Basak, Krishna Venkataraman, Ryan Murphy, and Manpreet Singh. This book provides lessons on Real-time data processing for quick insights using Azure Stream Analytics. Say hello to Streaming Analytics How to get started with Azure Stream Analytics and 7 reasons to choose it

0
0
41458

Packt

02 Sep 2013

6 min read

OAuth Authentication

Packt

02 Sep 2013

6 min read

(For more resources related to this topic, see here.) Understanding OAuth OAuth has the concept of Providers and Clients. An OAuth Provider is like a SAML Identity Provider, and is the place where the user enters their authentication credentials. Typical OAuth Providers include Facebook and Google. OAuth Clients are resources that want to protect resources, such as a SAML Service Provider. If you have ever been to a site that has asked you to log in using your Twitter or LinkedIn credentials then odds are that site was using OAuth. The advantage of OAuth is that a user’s authentication credentials (username and password, for instance) is never passed to the OAuth Client, just a range of tokens that the Client requested from the Provider and which are authorized by the user. OpenAM can act as both an OAuth Provider and an OAuth Client. This chapter will focus on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. Preparing Facebook as an OAuth Provider Head to https://developers.facebook.com/apps/ and create a Facebook App. Once this is created, your Facebook App will have an App ID and an App Secret. We’ll use these later on when configuring OpenAM. Facebook won’t let a redirect to a URL (such as our OpenAM installation) without being aware of the URL. The steps for preparing Facebook as an OAuth provider are as follows: Under the settings for the App in the section Website with Facebook Login we need to add a Site URL. This is a special OpenAM OAuth Proxy URL, which for me was http://openam.kenning.co.nz:8080/openam/oauth2c/OAuthProxy.jsp as shown in the following screenshot: Click on the Save Changes button on Facebook. My OpenAM installation for this chapter was directly available on the Internet just in case Facebook checked for a valid URL destination. Configuring an OAuth authentication module OpenAM has the concept of authentication modules, which support different ways of authentication, such as OAuth, or against its Data Store, or LDAP or a Web Service. We need to create a new Module Instance for our Facebook OAuth Client. Log in to OpenAM console. Click on the Access Control tab, and click on the link to the realm / (Top Level Realm). Click on the Authentication tab and scroll down to the Module Instances section. Click on the New button. Enter a name for the New Module Instance and select OAuth 2.0 as the Type and click on the OK button. I used the name Facebook. You will then see a screen as shown: For Client Id, use the App ID value provided from Facebook. For the Client Secret use the App Secret value provided from Facebook as shown in the preceding screenshot. Since we’re using Facebook as our OAuth Provider, we can leave the Authentication Endpoint URL, Access Token Endpoint URL, and User Profile Service URL values as their default values. Scope defines the permissions we’re requesting from the OAuth Provider on behalf of the user. These values will be provided by the OAuth Provider, but we’ll use the default values of email and read_stream as shown in the preceding screenshot. Proxy URL is the URL we copied to Facebook as the Site URL. This needs to be replaced with your OpenAM installation value. The Account Mapper Configuration allows you to map values from your OAuth Provider to values that OpenAM recognizes. For instance, Facebook calls emails email while OpenAM references values from the directory it is connected to, such as mail in the case of the embedded LDAP server. This goes the same for the Attribute Mapper Configuration. We’ll leave all these sections as their defaults as shown in the preceding screenshot. OpenAM allows attributes passed from the OAuth Provider to be saved to the OpenAM session. We’ll make sure this option is Enabled as shown in the preceding screenshot. When a user authenticates against an OAuth Provider, they are likely to not already have an account with OpenAM. If they do not have a valid OpenAM account then they will not be allowed access to resources protected by OpenAM. We should make sure that the option to Create account if it does not exist is Enabled as shown in the preceding screenshot. Forcing authentication against particular authentication modules In the writing of this book I disabled the Create account if it does not exist option while I was testing. Then when I tried to log into OpenAM I was redirected to Facebook, which then passed my credentials to OpenAM. Since there was no valid OpenAM account that matched my Facebook credentials I could not log in. For your own testing, it would be recommended to use http://openam.kenning.co.nz:8080/openam/UI/Login?module=Facebook rather than changing your authentication chain. Thankfully, you can force a login using a particular authentication module by adjusting the login URL. By using http://openam.kenning.co.nz:8080/openam/UI/Login?module=DataStore, I was able to use the Data Store rather than OAuth authentication module, and log in successfully. For our newly created accounts we can choose to prompt the user to create a password and enter an activation code. For our prototype we’ll leave this option as Disabled. The flip side to Single Sign On is Single Log Out. Your OAuth Provider should provide a logout URL which we could possibly call to log out a user when they log out of OpenAM. The options we have when a user logs out of OpenAM is to either not log them out of the OAuth Provider, to log them out of the OAuth Provider, or to ask the user. If we had set earlier that we wanted to enforce password and activation token policies, then we would need to enter details of an SMTP server, which would be used to email the activation token to the user. For the purposes of our prototype we’ll leave all these options blank. Click on the Save button. Summary This article served as a quick primer on what OAuth is and how to achieve it with OpenAM. It covered the concept of using Facebook as an OAuth provider and configuring an OAuth module. It focused on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. This would really help when we might want to allow authentication against Facebook or Google. Resources for Article: Further resources on this subject: Getting Started with OpenSSO [Article] OpenAM: Oracle DSEE and Multiple Data Stores [Article] OpenAM Identity Stores: Types, Supported Types, Caching and Notification [Article]

0
0
41186

article-image-quantum-computing-edge-analytics-and-meta-learning-key-trends-in-data-science-and-big-data-in-2019

Richard Gall

18 Dec 2018

11 min read

Quantum computing, edge analytics, and meta learning: key trends in data science and big data in 2019

Richard Gall

18 Dec 2018

11 min read

0
0
41148

article-image-what-makes-hadoop-so-revolutionary

Packt

20 Feb 2018

17 min read

What makes Hadoop so revolutionary?

Packt

20 Feb 2018

17 min read

In this article by Sourav Gulati and Sumit Kumar authors of book Apache Spark 2.x for Java Developers , explain in classical sense if we are to talk of Hadoop, then it comprises of two components a storage layer called HDFS and a processing layer called MapReduce. The resource management task prior to Hadoop 2.X was done using MapReduce Framework of Hadoop itself, however that changed with the introduction of YARN. In Hadoop 2.0 YARN was introduced as the third component of Hadoop to manage the resources of Hadoop Cluster and make it more Map Reduce agnostic. (For more resources related to this topic, see here.) HDFS Hadoop Distributed File System as the name suggests is a distributed file system based on the lines of Google File System written in Java. In practice HDFS resembles closely like any other UNIX file system with support for common file operations like ls, cp, rm, du, cat and so on. However what makes HDFS stand out despite its simplicity, is its mechanism to handle node failure in Hadoop cluster without effectively changing the seek time for accessing stored files. HDFS cluster consists of two major components: Data Nodes and Name Node. HDFS has a unique way of storing data on HDFS clusters (cheap commodity networked commodity computers). It splits the regular file in smaller chunks called blocks and then makes an exact number of copies of such chunks depending on the replication factor for that file. After that it copies such chunks to different Data Nodes of the Cluster. Name Node Name Node is responsible for managing the metadata of HDFS cluster such as list of files and folders that exist in a cluster, number of splits each file is divided into and their replication and storage at different Data Nodes. It also maintains and manages the namespace and file permission of all the files available in HDFS cluster. Apart from bookkeeping Name Node also has a supervisory role that keeps a watch on the replication factor of all the files and if some block goes missing then issue commands to replicate the missing block of data. It also generates reports to ascertain cluster health too. It is important to note that all the communication for supervisory task happens from Data Node to Name node that is Data Node sends reports a.k.a block reports to Name Node and it is then that Name Node responds to them by issuing different commands or instructions as the need may be. HDFS I/O A HDFS read operation from a client involves: Client requests the NameNode to determine where the actual data blocks are stored for a given file. Name Node obliges by providing the Block IDs and locations of the hosts (Data Node ) where the data can be found. The client contacts the Data Node with respective Block IDs to fetches the data from Data Node while preserving the order of the block files. A HDFS write operation from a client involves: Client contacts the Name Node to update the namespace with the file name and verify necessary permissions. If the file exists then Name Node throws an error else return the client FSDataOutputStream which points to data queue. The data queue negotiates with the NameNode to allocate new blocks on suitable DataNodes. The data is then copied to that DataNode, and as per replication strategy the data it further copied from that DataNode to rest of the DataNodes. It’s important to note that the data is never moved through the NameNode as it would have caused performance bottleneck. YARN Simplest way to understand Yet Another Resource manager (YARN) is to think of it as an operating system on a Cluster; provisioning resources, scheduling jobs & node maintenance. With Hadoop 2.x, MapReduce model of processing the data and managing the cluster (job tracker/task tracker) was divided. While data processing was still left to MapReduce, the cluster’s resource allocation (or rather, scheduling) task was assigned to a new component called YARN. Another objective that YARN met was that it made MapReduce one of the techniques to process the data rather than being the only technology to process data on HDFS as was the case in Hadoop 1.x systems. This paradigm shift opened the flood gate for the development of interesting applications around Hadoop and a new eco-system of not only classical MapReduce processing system evolved. It didn’t take much time after that for Apache Spark to break the hegemony of classical MapReduce and become arguably the most popular processing framework for parallel computing as far as active development and adoption is concerned. In order to serve Multi-tenancy, fault tolerance, and resource isolation in YARN, it developed below components to manage the cluster seamlessly. ResourceManager: It negotiates resources for different compute programmes on a Hadoop cluster while guaranteeing the following: resource isolation, data locality, fault tolerance, task prioritization and effective cluster capacity utilization. A configurable scheduler allows Resource Manager the flexibility to schedule and prioritize different applications as per the need. Tasks served by RM while serving clients: Using client or APIs user can submit or terminate an application. The user can also gather statistics on submitted application, cluster and queue information. RM also priorities ADMIN tasks higher over any other task to perform clean up or maintenance activities on a cluster like refreshing node-list, the queues configuration. Tasks served by RM while serving Cluster Nodes: Provisioning and de-provisioning of new nodes forms an important task of RM. Each node sends a heartbeat at a configured interval, default being 10 minutes. Any failure of node in doing so is treated as dead node. As a clean-up activity all the supposedly running process including containers are marked dead too. Tasks served by RM while serving Application Master: RM registers new AM while terminating the successfully executed ones. Just like Cluster Nodes if the heartbeat of AM is not received within a preconfigured duration, default value being 10 minutes, then AM is marked dead and all the associated containers too are marked dead. But since YARN is reliable as far as Application execution is concerned hence a new AM is rescheduled to try another execution on a new container until it reaches the retry configurable default count of 4. Scheduling and other miscellaneous tasks served by RM: RM maintains a list of running, submitted and executed applications along with its statistics such as execution time , status etc. Privileges of user as well as of applications are maintained and compared while serving various requests of user per application life cycle. RM scheduler oversees resource allocation for application such as memory allocation. Two common scheduling algorithms used in YARN are fair scheduling and capacity scheduling algorithms. NodeManager: NM exist per node of the cluster on a slightly similar fashion as to what slave nodes are in master slave architecture. When a NM starts it sends the information to RM for its availability to share its resources for upcoming jobs. There on NM sends periodic signal also called heartbeat to RM informing them of its status as being alive in the cluster. Primarily NM is responsible for launching containers that has been requested by AM with certain resource requirement such as memory, disk and so on. Once the containers are up and running the NM keeps a watch not on the status of the container’s task but on the resource utilization of the container and kill them if the container start utilizing more resources then it has been provisioned for. Apart from managing the life cycle of the container the NM also keeps RM informed about node’s health. ApplicationMaster: AM gets launched per submitted application and manages the life cycle of submitted application. However the first and foremost task AM does is to negotiate resources from RM to launch task specific containers at different nodes. Once containers are launched the AM keeps track of all the containers’ task status. If any node goes down or the container gets killed because of using excess resources or otherwise in such cases AM renegotiates resources from RM and launch those pending tasks again. AM also keeps reporting the status of the submitted application directly to the user and other such statistics to RM. ApplicationMaster implementation is framework specific and it is because of this reason application/framework specific code if transferred the AM , and it the AM that distributes it further across. This important feature also makes YARN technology agnostic as any framework can implement its ApplicationMaster and then utilized the resources of YARN cluster seamlessly. Container: Container in an abstract sense is a set of minimal resources such as CPU, RAM, Disk I/O, Disk space etc. that are required to run a task independently on a node. The first container after submitting the job is launched by RM to host ApplicationMaster. It is the AM which then negotiates resources from RM in the form of containers, which then gets hosted in different nodes across the Hadoop Cluster. Process flow of application submission in YARN: Step 1: Using a client or APIs the user submits the application let’s say a Spark Job jar. Resource Manager, whose primary task is to gather and report all the applications running on entire Hadoop cluster and available resources on respective Hadoop nodes, depending on the privileges of the user submitting the job accepts the newly submitted task. Step2: After this RM delegates the task to scheduler. The scheduler then searches for a container which can host the application-specific Application Master. While Scheduler does takes into consideration parameters like availability of resources, task priority, data locality etc. before scheduling or launching an Application Master, it has no role in monitoring or restarting a failed job. It is the responsibility of RM to keep track of AM and restart them in a new container when be it fails. Step 3: Once the Application Master gets launched it becomes the prerogative of AM to oversee the resources negotiation with RM for launching task specific containers. Negotiations with RM is typically over: The priority of the tasks at hand. Number of containers to be launched to complete the tasks. The resources need to execute the tasks i.e. RAM, CPU (since Hadoop 3.x). Available nodes where job containers can be launched with required resources Depending on the priority and availability of resources the RM grants containers represented by container ID and hostname of the node on which it can be launched. Step 4: The AM then request the NM of the respective hosts to launch the containers with specific ID’s and resource configuration. The NM then launches the containers but keeps a watch on the resources usage of the task. If for example the container starts utilizing more resources than it has been provisioned for then in such scenario the said containers are killed by the NM. This greatly improves the job isolation and fair sharing of resources guarantee that YARN provides as otherwise it would have impacted the execution of other containers. However, it is important to note that the job status and application status as a whole is managed by AM. It falls in the domain of AM to continuously monitor any delay or dead containers, simultaneously negotiating with RM to launch new containers to reassign the task of dead containers. Step 5: The Containers executing on different nodes sends Application specific statistics to AM at specific intervals. Step 6: AM also reports the status of the application directly to the client that submitted the specific application, in our case a Spark Job. Step 7: NM monitors the resources being utilized by all the containers on the respective nodes and keeps sending a periodic update to RM. Step 8: The AM sends periodic statistics such application status, task failure, log information to RM Overview Of MapReduce Before delving deep into MapReduce implementation in Hadoop, let’s first understand the MapReduce as a concept in parallel computing and why it is a preferred way of computing. MapReduce comprises two mutually exclusive but dependent phases each capable of running on two different machines or nodes: Map: In Map phase transformation of data takes place. It splits data into key value pair by splitting it on a keyword. Suppose we have a text file and we would want to do an analysis such as to count total number of words or even the frequency with which the word has occurred in the text file. This is the classical Word Count problem of MapReduce, now to address this problem first we will have to identify the splitting keyword so that the data can be spilt and be converted into a key value pair. Let’s begin with John Lennon's song Imagine. Sample Text: Imagine there's no heaven It's easy if you try No hell below us Above us only sky Imagine all the people living for today After running Map phase on the sampled text and splitting it over <space> it will get converted to key value pair as follows: <imagine, 1> <there's, 1> <no, 1> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <no, 1> <hell, 1> <below, 1> <us, 1> <above, 1> <us, 1> <only, 1> <sky, 1> <imagine, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] The key here represents the word and value represents the count, also it should be noted that we have converted all the keys to lowercase to reduce any further complexity arising out of matching case sensitive keys. Reduce: Reduce phase deals with aggregation of Map phase result and hence all the key value pairs are aggregated over key. So the Map output of the text would get aggregated as follows: [<imagine, 2> <there's, 1> <no, 2> <heaven, 1> <it's, 1> <easy, 1> <if, 1> <you, 1> <try, 1> <hell, 1> <below, 1> <us, 2> <above, 1> <only, 1> <sky, 1> <all, 1> <the, 1> <people, 1> <living, 1> <for, 1> <today, 1>] As we can see both Map and Reduce phase can be run exclusively and hence can use independent nodes in cluster to process the data. This approach of separation of tasks into smaller units called Map and Reduce has revolutionized general purpose distributed/parallel computing, which we now know as MapReduce. Apache Hadoop's MapReduce has been implemented pretty much the same way as discussed except for adding extra features into how the data from Map phase of each node gets transferred to their designated Reduce phase node. Hadoop's implementation of MapReduce enriches the Map and Reduce phase by adding few more concrete steps in between to make it fault tolerant and truly distributed. We can describe MR jobs on YARN in five stages. Job Submission Stage: When a client submits a MR Job following things happen RM is requested for an application ID. Input data location is checked and if present then file split size is computed. Job's output location need to exist as well. If all the three conditions are met then the MR job jar along with its configuration ,details of input split are copied to HDFS in a directory named the application ID provided by RM. And then the job is submitted to RM to launch a job specific Application Master, MRAppMaster. MAP Stage: Once RM receives the client's request for launching MRAppMaster, a call is made to YARN scheduler for assigning a container. As per resource availability the container is granted and hence the MRAppMaster is launched at the designated node with provisioned resources. After this MRAppMaster fetches input split information from the HDFS path that was submitted by the client and computes the number of Mapper task that will be launched based on the splits. Depending on number of Mappers it also calculates the required number of Reducers as per configuration, If MRAppMaster now finds the number of Mapper ,Reducer & size of input files to be small enough to be run in the same JVM then it goes ahead in doing so, such tasks are called Uber task. However, in other scenarios MRAppMaster negotiates container resources from RM for running these tasks albeit Mapper tasks having higher order and priority. This is so as Mapper tasks must finish before sorting phase can start. Data locality is another concern for containers hosting Mappers as data local nodes are preferred over rack local, with least preference being given to remote node hosted data. But when it comes to Reduce phase no such preference of data locality exist for containers. Containers hosting Mapper function first copy mapReduce JAR & configuration files locally and then launch a class YarnChild in the JVM. The mapper then start reading the input files, process them by making key value pairs and writes them in a circular buffer. Shuffle and Sort Phase: Considering circular buffer has size constraint, after a certain percentage where default being 80, a thread gets spawned which spills the data from buffer. But before copying the spilled data to disk, it is first partitioned with respect to its Reducer then the background thread also sorts the partitioned data on key and if combiner is mentioned then combines the data too. This process optimizes the data once it is copied to their respective partitioned folder. This process is continued until all the data from circular buffer gets written to disk. A background thread again checks if the number of spilled files in each partition is within the range of configurable parameter or else the files are merged and combiner is run over them until it falls within the limit of the parameter. Map task keeps updating the status to ApplicationMaster its entire life cycle, it is only when 5 percent of Map task has been completed that the reduce task start. An auxiliary service in the NodeManager serving Reduce task starts a Netty web server that makes a request to MRAppMaster for Mapper hosts having specific Mapper partitioned files. All the partitioned files that pertain to the Reducer is copied to their respective nodes in similar fashion. Since multiple files gets copied as data from various nodes representing that reduce nodes gets collected, a background thread merges the sorted map file again sorts them and if Combiner is configured then combines the result too. Reduce Stage: It is important to note here that at this stage every input file of each reducer should have been sorted by key, this is the presumption with which Reducer starts processing these records and converts the key value pair into aggregated list. Once reducer processes the data it writes them to the output folder as was mentioned during Job submission. Clean up stage: Each Reducer sends periodic update to MRAppMaster about the task completion, once the Reduce task is over the application master starts the clean-up activity. The submitted job status is changed from running to successful, all the temporary and intermediate files and folders are deleted .The application statistics are archived to job history server. Summary In this article we saw what is HDFS and YARN along with MapReduce in which we learned different function of MapReduce and HDFS I/O. Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames [article] Five common questions for .NET/Java developers learning JavaScript and Node.js [article] Getting Started with Apache Hadoop and Apache Spark [article]

0
0
41115

How-To Tutorials - Data

Build and train an RNN chatbot using TensorFlow [Tutorial]

Google is circumventing GDPR, reveals Brave's investigation for the Authorized Buyers ad business case

How to effectively clean social media data for analysis

Recode Decode #GoogleWalkout interview shows why data and evidence don’t always lead to right decisions in even the world’s most data-driven company

Artificial intelligence, data science, and big data in 2019: what really mattered

10 machine learning algorithms every engineer needs to know

What are data professionals planning to learn this year? Python, deep learning, yes. But also...

Visualizing BigQuery Data with Tableau

#GoogleWalkout demanded a ‘truly equitable culture for everyone’; Pichai shares a “comprehensive” plan for employees to safely report sexual harassment

9 Useful R Packages for NLP & Text Mining

Trending Topics

Data science vs. machine learning: understanding the difference and what it means today

Performing Vehicle Telemetry job analysis with Azure Stream Analytics tools

OAuth Authentication

Quantum computing, edge analytics, and meta learning: key trends in data science and big data in 2019

What makes Hadoop so revolutionary?

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading