Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-configuring-and-deploying-hbase-tutorial

02 Jul 2018

21 min read

Configuring and deploying HBase [Tutorial]

02 Jul 2018

0
0
8894

article-image-administration-rights-for-power-bi-users

Pravin Dhandre

02 Jul 2018

8 min read

Administration rights for Power BI users

Pravin Dhandre

02 Jul 2018

8 min read

In this tutorial, you will understand and learn administration rights/rules for Power BI users. This includes setting and monitoring rules like; who in the organization can utilize which feature, how Power BI Premium capacity is allocated and by whom, and other settings such as embed codes and custom visuals. This article is an excerpt from a book written by Brett Powell titled Mastering Microsoft Power BI. The admin portal is accessible to Office 365 Global Administrators and users mapped to the Power BI service administrator role. To open the admin portal, log in to the Power BI service and select the Admin portal item from the Settings (Gear icon) menu in the top right, as shown in the following screenshot: All Power BI users, including Power BI free users, are able to access the Admin portal. However, users who are not admins can only view the Capacity settings page. The Power BI service administrators and Office 365 global administrators have view and edit access to the following seven pages: Administrators of Power BI most commonly utilize the Tenant settings and Capacity settings as described in the Tenant Settings and Power BI Premium Capacities sections later in this tutorial. However, the admin portal can also be used to manage any approved custom visuals for the organization. Usage metrics The Usage metrics page of the Admin portal provides admins with a Power BI dashboard of several top metrics, such as the most consumed dashboards and the most consumed dashboards by workspace. However, the dashboard cannot be modified and the tiles of the dashboard are not linked to any underlying reports or separate dashboards to support further analysis. Given these limitations, alternative monitoring solutions are recommended, such as the Office 365 audit logs and usage metric datasets specific to Power BI apps. Details of both monitoring options are included in the app usage metrics and Power BI audit log activities sections later in this chapter. Users and Audit logs The Users and Audit logs pages only provide links to the Office 365 admin center. In the admin center, Power BI users can be added, removed and managed. If audit logging is enabled for the organization via the Create audit logs for internal activity and auditing and compliance tenant setting, this audit log data can be retrieved from the Office 365 Security & Compliance Center or via PowerShell. This setting is noted in the following section regarding the Tenant settings tab of the Power BI admin portal. An Office 365 license is not required to utilize the Office 365 admin center for Power BI license assignments or to retrieve Power BI audit log activity. Tenant settings The Tenant settings page of the Admin portal allows administrators to enable or disable various features of the Power BI web service. Likewise, the administrator could allow only a certain security group to embed Power BI content in SaaS applications such as SharePoint Online. The following diagram identifies the 18 tenant settings currently available in the admin portal and the scope available to administrators for configuring each setting: From a data security perspective, the first seven settings within the Export and Sharing and Content packs and apps groups are most important. For example, many organizations choose to disable the Publish to web feature for the entire organization. Additionally, only certain security groups may be allowed to export data or to print hard copies of reports and dashboards. As shown in the Scope column of the previous table and the following example, granular security group configurations are available to minimize risk and manage the overall deployment. Currently, only one tenant setting is available for custom visuals and this setting (Custom visuals settings) can be enabled or disabled for the entire organization only. For organizations that wish to restrict or prohibit custom visuals for security reasons, this setting can be used to eliminate the ability to add, view, share, or interact with custom visuals. More granular controls to this setting are expected later in 2018, such as the ability to define users or security groups of users who are allowed to use custom visuals. In the following screenshot from the Tenant settings page of the Admin portal, only the users within the BI Admin security group who are not also members of the BI Team security group are allowed to publish apps to the entire organization: For example, a report author who also helps administer the On-premises data gateway via the BI Admin security group would be denied the ability to publish apps to the organization given membership in the BI Team security group. Many of the tenant setting configurations will be more simple than this example, particularly for smaller organizations or at the beginning of Power BI deployments. However, as adoption grows and the team responsible for Power BI changes, it's important that the security groups created to help administer these settings are kept up to date. Embed Codes Embed Codes are created and stored in the Power BI service when the Publish to web feature is utilized. As described in the Publish to web section of the previous chapter, this feature allows a Power BI report to be embedded in any website or shared via URL on the public internet. Users with edit rights to the workspace of the published to web content are able to manage the embed codes themselves from within the workspace. However, the admin portal provides visibility and access to embed codes across all workspaces, as shown in the following screenshot: Via the Actions commands on the far right of the Embed Codes page, a Power BI Admin can view the report in a browser (diagonal arrow) or remove the embed code. The Embed Codes page can be helpful to periodically monitor the usage of the Publish to web feature and for scenarios in which data was included in a publish to web report that shouldn't have been, and thus needs to be removed. As shown in the Power BI Tenant settings table referenced in the previous section, this feature can be enabled or disabled for the entire organization or for specific users within security groups. Organizational Custom visuals The Custom Visuals page allows admins to upload and manage custom visuals (.pbiviz files) that have been approved for use within the organization. For example, an organization may have proprietary custom visuals developed internally, which it wishes to expose to business users. Alternatively, the organization may wish to define a set of approved custom visuals, such as only the custom visuals that have been certified by Microsoft. In the following screenshot, the Chiclet Slicer custom visual is added as an organizational custom visual from the Organizational visuals page of the Power BI admin portal: The Organizational visuals page provides a link (Add a custom visual) to launch the form and identifies all uploaded visuals, as well as their last update. Once a visual has been uploaded, it can be deleted but not updated or modified. Therefore, when a new version of an organizational visual becomes available, this visual can be added to the list of organizational visuals with a descriptive title (Chiclet Slicer v2.0). Deleting an organizational custom visual will cause any reports that use this visual to stop rendering. The following screenshot reflects the uploaded Chiclet Slicer custom visual on the Organization visuals page: Once the custom visual has been uploaded as an organizational custom visual, it will be accessible to users in Power BI Desktop. In the following screenshot from Power BI Desktop, the user has opened the MARKETPLACE of custom visuals and selected MY ORGANIZATION: In this screenshot, rather than searching through the MARKETPLACE, the user can go directly to visuals defined by the organization. The marketplace of custom visuals can be launched via either the Visualizations pane or the From Marketplace icon on the Home tab of the ribbon. Organizational custom visuals are not supported for reports or dashboards shared with external users. Additionally, organizational custom visuals used in reports that utilize the publish to web feature will not render outside the Power BI tenant. Moreover, Organizational custom visuals are currently a preview feature. Therefore, users must enable the My organization custom visuals feature via the Preview features tab of the Options window in Power BI Desktop. With this, we got you acquainted with features and processes applicable in administering Power BI for an organization. This includes the configuration of tenant settings in the Power BI admin portal, analyzing the usage of Power BI assets, and monitoring overall user activity via the Office 365 audit logs. If you found this tutorial useful, do check out the book Mastering Microsoft Power BI to develop visually rich, immersive, and interactive Power BI reports and dashboards. Unlocking the secrets of Microsoft Power BI A tale of two tools: Tableau and Power BI Building a Microsoft Power BI Data Model

0
0
7578

article-image-build-and-train-rnn-chatbot-using-tensorflow

Sunith Shetty

28 Jun 2018

21 min read

Build and train an RNN chatbot using TensorFlow [Tutorial]

Sunith Shetty

28 Jun 2018

21 min read

Chatbots are increasingly used as a way to provide assistance to users. Many companies, including banks, mobile/landline companies and large e-sellers now use chatbots for customer assistance and for helping users in pre and post sales queries. They are a great tool for companies which don't need to provide additional customer service capacity for trivial questions: they really look like a win-win situation! In today’s tutorial, we will understand how to train an automatic chatbot that will be able to answer simple and generic questions, and how to create an endpoint over HTTP for providing the answers via an API. This article is an excerpt from a book written by Luca Massaron, Alberto Boschetti, Alexey Grigorev, Abhishek Thakur, and Rajalingappaa Shanmugamani titled TensorFlow Deep Learning Projects. There are mainly two types of chatbot: the first is a simple one, which tries to understand the topic, always providing the same answer for all questions about the same topic. For example, on a train website, the questions Where can I find the timetable of the City_A to City_B service? and What's the next train departing from City_A? will likely get the same answer, that could read Hi! The timetable on our network is available on this page: <link>. This types of chatbots use classification algorithms to understand the topic (in the example, both questions are about the timetable topic). Given the topic, they always provide the same answer. Usually, they have a list of N topics and N answers; also, if the probability of the classified topic is low (the question is too vague, or it's on a topic not included in the list), they usually ask the user to be more specific and repeat the question, eventually pointing out other ways to do the question (send an email or call the customer service number, for example). The second type of chatbots is more advanced, smarter, but also more complex. For those, the answers are built using an RNN, in the same way, that machine translation is performed. Those chatbots are able to provide more personalized answers, and they may provide a more specific reply. In fact, they don't just guess the topic, but with an RNN engine, they're able to understand more about the user's questions and provide the best possible answer: in fact, it's very unlikely you'll get the same answers with two different questions using these types of chatbots. The input corpus Unfortunately, we haven't found any consumer-oriented dataset that is open source and freely available on the Internet. Therefore, we will train the chatbot with a more generic dataset, not really focused on customer service. Specifically, we will use the Cornell Movie Dialogs Corpus, from the Cornell University. The corpus contains the collection of conversations extracted from raw movie scripts, therefore the chatbot will be able to answer more to fictional questions than real ones. The Cornell corpus contains more than 200,000 conversational exchanges between 10+ thousands of movie characters, extracted from 617 movies. The dataset is available here: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. We would like to thank the authors for having released the corpus: that makes experimentation, reproducibility and knowledge sharing easier. The dataset comes as a .zip archive file. After decompressing it, you'll find several files in it: README.txt contains the description of the dataset, the format of the corpora files, the details on the collection procedure and the author's contact. Chameleons.pdf is the original paper for which the corpus has been released. Although the goal of the paper is strictly not around chatbots, it studies the language used in dialogues, and it's a good source of information to understanding more movie_conversations.txt contains all the dialogues structure. For each conversation, it includes the ID of the two characters involved in the discussion, the ID of the movie and the list of sentences IDs (or utterances, to be more precise) in chronological order. For example, the first line of the file is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'] That means that user u0 had a conversation with user u2 in the movie m0 and the conversation had 4 utterances: 'L194', 'L195', 'L196' and 'L197' movie_lines.txt contains the actual text of each utterance ID and the person who produced it. For example, the utterance L195 is listed here as: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. So, the text of the utterance L195 is Well, I thought we'd start with pronunciation, if that's okay with you. And it was pronounced by the character u2 whose name is CAMERON in the movie m0. movie_titles_metadata.txt contains information about the movies, including the title, year, IMDB rating, the number of votes in IMDB and the genres. For example, the movie m0 here is described as: m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance'] So, the title of the movie whose ID is m0 is 10 things i hate about you, it's from 1999, it's a comedy with romance and it received almost 63 thousand votes on IMDB with an average score of 6.9 (over 10.0) movie_characters_metadata.txt contains information about the movie characters, including the name the title of the movie where he/she appears, the gender (if known) and the position in the credits (if known). For example, the character “u2” appears in this file with this description: u2 +++$+++ CAMERON +++$+++ m0 +++$+++ 10 things i hate about you +++$+++ m +++$+++ 3 The character u2 is named CAMERON, it appears in the movie m0 whose title is 10 things i hate about you, his gender is male and he's the third person appearing in the credits. raw_script_urls.txt contains the source URL where the dialogues of each movie can be retrieved. For example, for the movie m0 that's it: m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html As you will have noticed, most files use the token +++$+++ to separate the fields. Beyond that, the format looks pretty straightforward to parse. Please take particular care while parsing the files: their format is not UTF-8 but ISO-8859-1. Creating the training dataset Let's now create the training set for the chatbot. We'd need all the conversations between the characters in the correct order: fortunately, the corpora contains more than what we actually need. For creating the dataset, we will start by downloading the zip archive, if it's not already on disk. We'll then decompress the archive in a temporary folder (if you're using Windows, that should be C:Temp), and we will read just the movie_lines.txt and the movie_conversations.txt files, the ones we really need to create a dataset of consecutive utterances. Let's now go step by step, creating multiple functions, one for each step, in the file corpora_downloader.py. The first function we need is to retrieve the file from the Internet, if not available on disk. def download_and_decompress(url, storage_path, storage_dir): import os.path directory = storage_path + "/" + storage_dir zip_file = directory + ".zip" a_file = directory + "/cornell movie-dialogs corpus/README.txt" if not os.path.isfile(a_file): import urllib.request import zipfile urllib.request.urlretrieve(url, zip_file) with zipfile.ZipFile(zip_file, "r") as zfh: zfh.extractall(directory) return This function does exactly that: it checks whether the “README.txt” file is available locally; if not, it downloads the file (thanks for the urlretrieve function in the urllib.request module) and it decompresses the zip (using the zipfile module). The next step is to read the conversation file and extract the list of utterance IDS. As a reminder, its format is: u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197'], therefore what we're looking for is the fourth element of the list after we split it on the token +++$+++ . Also, we'd need to clean up the square brackets and the apostrophes to have a clean list of IDs. For doing that, we shall import the re module, and the function will look like this. import re def read_conversations(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_conversations.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: conversations_chunks = [line.split(" +++$+++ ") for line in fh] return [re.sub('[[]']', '', el[3].strip()).split(", ") for el in conversations_chunks] As previously said, remember to read the file with the right encoding, otherwise, you'll get an error. The output of this function is a list of lists, each of them containing the sequence of utterance IDS in a conversation between characters. Next step is to read and parse the movie_lines.txt file, to extract the actual utterances texts. As a reminder, the file looks like this line: L195 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Well, I thought we'd start with pronunciation, if that's okay with you. Here, what we're looking for are the first and the last chunks. def read_lines(storage_path, storage_dir): filename = storage_path + "/" + storage_dir + "/cornell movie-dialogs corpus/movie_lines.txt" with open(filename, "r", encoding="ISO-8859-1") as fh: lines_chunks = [line.split(" +++$+++ ") for line in fh] return {line[0]: line[-1].strip() for line in lines_chunks} The very last bit is about tokenization and alignment. We'd like to have a set whose observations have two sequential utterances. In this way, we will train the chatbot, given the first utterance, to provide the next one. Hopefully, this will lead to a smart chatbot, able to reply to multiple questions. Here's the function: def get_tokenized_sequencial_sentences(list_of_lines, line_text): for line in list_of_lines: for i in range(len(line) - 1): yield (line_text[line[i]].split(" "), line_text[line[i+1]].split(" ")) Its output is a generator containing a tuple of the two utterances (the one on the right follows temporally the one on the left). Also, utterances are tokenized on the space character. Finally, we can wrap up everything into a function, which downloads the file and unzip it (if not cached), parse the conversations and the lines, and format the dataset as a generator. As a default, we will store the files in the /tmp directory: def retrieve_cornell_corpora(storage_path="/tmp", storage_dir="cornell_movie_dialogs_corpus"): download_and_decompress("http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip", storage_path, storage_dir) conversations = read_conversations(storage_path, storage_dir) lines = read_lines(storage_path, storage_dir) return tuple(zip(*list(get_tokenized_sequencial_sentences(conversations, lines)))) At this point, our training set looks very similar to the training set used in the translation project. We can, therefore, use some pieces of code we've developed in the machine learning translation article. For example, the corpora_tools.py file can be used here without any change (also, it requires the data_utils.py). Given that file, we can dig more into the corpora, with a script to check the chatbot input. To inspect the corpora, we can use the corpora_tools.py, and the file we've previously created. Let's retrieve the Cornell Movie Dialog Corpus, format the corpora and print an example and its length: from corpora_tools import * from corpora_downloader import retrieve_cornell_corpora sen_l1, sen_l2 = retrieve_cornell_corpora() print("# Two consecutive sentences in a conversation") print("Q:", sen_l1[0]) print("A:", sen_l2[0]) print("# Corpora length (i.e. number of sentences)") print(len(sen_l1)) assert len(sen_l1) == len(sen_l2) This code prints an example of two tokenized consecutive utterances, and the number of examples in the dataset, that is more than 220,000: # Two consecutive sentences in a conversation Q: ['Can', 'we', 'make', 'this', 'quick?', '', 'Roxanne', 'Korrine', 'and', 'Andrew', 'Barrett', 'are', 'having', 'an', 'incredibly', 'horrendous', 'public', 'break-', 'up', 'on', 'the', 'quad.', '', 'Again.'] A: ['Well,', 'I', 'thought', "we'd", 'start', 'with', 'pronunciation,', 'if', "that's", 'okay', 'with', 'you.'] # Corpora length (i.e. number of sentences) 221616 Let's now clean the punctuation in the sentences, lowercase them and limits their size to 20 words maximum (that is examples where at least one of the sentences is longer than 20 words are discarded). This is needed to standardize the tokens: clean_sen_l1 = [clean_sentence(s) for s in sen_l1] clean_sen_l2 = [clean_sentence(s) for s in sen_l2] filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2) print("# Filtered Corpora length (i.e. number of sentences)") print(len(filt_clean_sen_l1)) assert len(filt_clean_sen_l1) == len(filt_clean_sen_l2) This leads us to almost 140,000 examples: # Filtered Corpora length (i.e. number of sentences) 140261 Then, let's create the dictionaries for the two sets of sentences. Practically, they should look the same (since the same sentence appears once on the left side, and once in the right side) except there might be some changes introduced by the first and last sentences of a conversation (they appear only once). To make the best out of our corpora, let's build two dictionaries of words and then encode all the words in the corpora with their dictionary indexes: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=15000, storage_path="/tmp/l1_dict.p") dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=15000, storage_path="/tmp/l2_dict.p") idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) print("# Same sentences as before, with their dictionary ID") print("Q:", list(zip(filt_clean_sen_l1[0], idx_sentences_l1[0]))) print("A:", list(zip(filt_clean_sen_l2[0], idx_sentences_l2[0]))) That prints the following output. We also notice that a dictionary of 15 thousand entries doesn't contain all the words and more than 16 thousand (less popular) of them don't fit into it: [sentences_to_indexes] Did not find 16823 words [sentences_to_indexes] Did not find 16649 words # Same sentences as before, with their dictionary ID Q: [('well', 68), (',', 8), ('i', 9), ('thought', 141), ('we', 23), ("'", 5), ('d', 83), ('start', 370), ('with', 46), ('pronunciation', 3), (',', 8), ('if', 78), ('that', 18), ("'", 5), ('s', 12), ('okay', 92), ('with', 46), ('you', 7), ('.', 4)] A: [('not', 31), ('the', 10), ('hacking', 7309), ('and', 23), ('gagging', 8761), ('and', 23), ('spitting', 6354), ('part', 437), ('.', 4), ('please', 145), ('.', 4)] As the final step, let's add paddings and markings to the sentences: data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) print("# Prepared minibatch with paddings and extra stuff") print("Q:", data_set[0][0]) print("A:", data_set[0][1]) print("# The sentence pass from X to Y tokens") print("Q:", len(idx_sentences_l1[0]), "->", len(data_set[0][0])) print("A:", len(idx_sentences_l2[0]), "->", len(data_set[0][1])) And that, as expected, prints: # Prepared minibatch with paddings and extra stuff Q: [0, 68, 8, 9, 141, 23, 5, 83, 370, 46, 3, 8, 78, 18, 5, 12, 92, 46, 7, 4] A: [1, 31, 10, 7309, 23, 8761, 23, 6354, 437, 4, 145, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0] # The sentence pass from X to Y tokens Q: 19 -> 20 A: 11 -> 22 Training the chatbot After we're done with the corpora, it's now time to work on the model. This project requires again a sequence to sequence model, therefore we can use an RNN. Even more, we can reuse part of the code from the previous project: we'd just need to change how the dataset is built, and the parameters of the model. We can then copy the training script, and modify the build_dataset function, to use the Cornell dataset. Mind that the dataset used in this article is bigger than the one used in the machine learning translation article, therefore you may need to limit the corpora to a few dozen thousand lines. On a 4 years old laptop with 8GB RAM, we had to select only the first 30 thousand lines, otherwise, the program ran out of memory and kept swapping. As a side effect of having fewer examples, even the dictionaries are smaller, resulting in less than 10 thousands words each. def build_dataset(use_stored_dictionary=False): sen_l1, sen_l2 = retrieve_cornell_corpora() clean_sen_l1 = [clean_sentence(s) for s in sen_l1][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP clean_sen_l2 = [clean_sentence(s) for s in sen_l2][:30000] ### OTHERWISE IT DOES NOT RUN ON MY LAPTOP filt_clean_sen_l1, filt_clean_sen_l2 = filter_sentence_length(clean_sen_l1, clean_sen_l2, max_len=10) if not use_stored_dictionary: dict_l1 = create_indexed_dictionary(filt_clean_sen_l1, dict_size=10000, storage_path=path_l1_dict) dict_l2 = create_indexed_dictionary(filt_clean_sen_l2, dict_size=10000, storage_path=path_l2_dict) else: dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2_length = len(dict_l2) idx_sentences_l1 = sentences_to_indexes(filt_clean_sen_l1, dict_l1) idx_sentences_l2 = sentences_to_indexes(filt_clean_sen_l2, dict_l2) max_length_l1 = extract_max_length(idx_sentences_l1) max_length_l2 = extract_max_length(idx_sentences_l2) data_set = prepare_sentences(idx_sentences_l1, idx_sentences_l2, max_length_l1, max_length_l2) return (filt_clean_sen_l1, filt_clean_sen_l2), data_set, (max_length_l1, max_length_l2), (dict_l1_length, dict_l2_length) By inserting this function into the train_translator.py file and rename the file as train_chatbot.py, we can run the training of the chatbot. After a few iterations, you can stop the program and you'll see something similar to this output: [sentences_to_indexes] Did not find 0 words [sentences_to_indexes] Did not find 0 words global step 100 learning rate 1.0 step-time 7.708967611789704 perplexity 444.90090078460474 eval: perplexity 57.442316329639176 global step 200 learning rate 0.990234375 step-time 7.700247814655302 perplexity 48.8545568311572 eval: perplexity 42.190180314697045 global step 300 learning rate 0.98046875 step-time 7.69800933599472 perplexity 41.620538109894945 eval: perplexity 31.291903031786116 ... ... ... global step 2400 learning rate 0.79833984375 step-time 7.686293318271639 perplexity 3.7086356605442767 eval: perplexity 2.8348589631663046 global step 2500 learning rate 0.79052734375 step-time 7.689657487869262 perplexity 3.211876894960698 eval: perplexity 2.973809378544393 global step 2600 learning rate 0.78271484375 step-time 7.690396382808681 perplexity 2.878854805600354 eval: perplexity 2.563583924617356 Again, if you change the settings, you may end up with a different perplexity. To obtain these results, we set the RNN size to 256 and 2 layers, the batch size of 128 samples, and the learning rate to 1.0. At this point, the chatbot is ready to be tested. Although you can test the chatbot with the same code as in the test_translator.py, here we would like to do a more elaborate solution, which allows exposing the chatbot as a service with APIs. Chatbox API First of all, we need a web framework to expose the API. In this project, we've chosen Bottle, a lightweight simple framework very easy to use. To install the package, run pip install bottle from the command line. To gather further information and dig into the code, take a look at the project webpage, https://bottlepy.org. Let's now create a function to parse an arbitrary sentence provided by the user as an argument. All the following code should live in the test_chatbot_aas.py file. Let's start with some imports and the function to clean, tokenize and prepare the sentence using the dictionary: import pickle import sys import numpy as np import tensorflow as tf import data_utils from corpora_tools import clean_sentence, sentences_to_indexes, prepare_sentences from train_chatbot import get_seq2seq_model, path_l1_dict, path_l2_dict model_dir = "/home/abc/chat/chatbot_model" def prepare_sentence(sentence, dict_l1, max_length): sents = [sentence.split(" ")] clean_sen_l1 = [clean_sentence(s) for s in sents] idx_sentences_l1 = sentences_to_indexes(clean_sen_l1, dict_l1) data_set = prepare_sentences(idx_sentences_l1, [[]], max_length, max_length) sentences = (clean_sen_l1, [[]]) return sentences, data_set The function prepare_sentence does the following: Tokenizes the input sentence Cleans it (lowercase and punctuation cleanup) Converts tokens to dictionary IDs Add markers and paddings to reach the default length Next, we will need a function to convert the predicted sequence of numbers to an actual sentence composed of words. This is done by the function decode, which runs the prediction given the input sentence and with softmax predicts the most likely output. Finally, it returns the sentence without paddings and markers: def decode(data_set): with tf.Session() as sess: model = get_seq2seq_model(sess, True, dict_lengths, max_sentence_lengths, model_dir) model.batch_size = 1 bucket = 0 encoder_inputs, decoder_inputs, target_weights = model.get_batch( {bucket: [(data_set[0][0], [])]}, bucket) _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket, True) outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] if data_utils.EOS_ID in outputs: outputs = outputs[1:outputs.index(data_utils.EOS_ID)] tf.reset_default_graph() return " ".join([tf.compat.as_str(inv_dict_l2[output]) for output in outputs]) Finally, the main function, that is, the function to run in the script: if __name__ == "__main__": dict_l1 = pickle.load(open(path_l1_dict, "rb")) dict_l1_length = len(dict_l1) dict_l2 = pickle.load(open(path_l2_dict, "rb")) dict_l2_length = len(dict_l2) inv_dict_l2 = {v: k for k, v in dict_l2.items()} max_lengths = 10 dict_lengths = (dict_l1_length, dict_l2_length) max_sentence_lengths = (max_lengths, max_lengths) from bottle import route, run, request @route('/api') def api(): in_sentence = request.query.sentence _, data_set = prepare_sentence(in_sentence, dict_l1, max_lengths) resp = [{"in": in_sentence, "out": decode(data_set)}] return dict(data=resp) run(host='127.0.0.1', port=8080, reloader=True, debug=True) Initially, it loads the dictionary and prepares the inverse dictionary. Then, it uses the Bottle API to create an HTTP GET endpoint (under the /api URL). The route decorator sets and enriches the function to run when the endpoint is contacted via HTTP GET. In this case, the api() function is run, which first reads the sentence passed as HTTP parameter, then calls the prepare_sentence function, described above, and finally runs the decoding step. What's returned is a dictionary containing both the input sentence provided by the user and the reply of the chatbot. Finally, the webserver is turned on, on the localhost at port 8080. Isn't very easy to have a chatbot as a service with Bottle? It's now time to run it and check the outputs. To run it, run from the command line: $> python3 –u test_chatbot_aas.py Then, let's start querying the chatbot with some generic questions, to do so we can use CURL, a simple command line; also all the browsers are ok, just remember that the URL should be encoded, for example, the space character should be replaced with its encoding, that is, %20. Curl makes things easier, having a simple way to encode the URL request. Here are a couple of examples: $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "i ' m here with you .", "in": "where are you?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you here?" {"data": [{"out": "yes .", "in": "are you here?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=are you a chatbot?" {"data": [{"out": "you ' for the stuff to be right .", "in": "are you a chatbot?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=what is your name ?" {"data": [{"out": "we don ' t know .", "in": "what is your name ?"}]} $> curl -X GET -G http://127.0.0.1:8080/api --data-urlencode "sentence=how are you?" {"data": [{"out": "that ' s okay .", "in": "how are you?"}]} If the system doesn't work with your browser, try encoding the URL, for example: $> curl -X GET http://127.0.0.1:8080/api?sentence=how%20are%20you? {"data": [{"out": "that ' s okay .", "in": "how are you?"}]}. Replies are quite funny; always remember that we trained the chatbox on movies, therefore the type of replies follow that style. To turn off the webserver, use Ctrl + C. To summarize, we've learned to implement a chatbot, which is able to respond to questions through an HTTP endpoint and a GET API. To know more how to design deep learning systems for a variety of real-world scenarios using TensorFlow, do checkout this book TensorFlow Deep Learning Projects. Facebook’s Wit.ai: Why we need yet another chatbot development framework? How to build a chatbot with Microsoft Bot framework Top 4 chatbot development frameworks for developers

0
1
44455

article-image-interact-with-hbase-using-hbase-shell-tutorial

Amey Varangaonkar

25 Jun 2018

9 min read

How to interact with HBase using HBase shell [Tutorial]

Amey Varangaonkar

25 Jun 2018

9 min read

0
0
9711

article-image-use-tensorflow-and-nlp-to-detect-duplicate-quora-questions-tutorial

Sunith Shetty

21 Jun 2018

32 min read

Use TensorFlow and NLP to detect duplicate Quora questions [Tutorial]

Sunith Shetty

21 Jun 2018

32 min read

0
1
40330

article-image-splunk-how-to-work-with-multiple-indexes-tutorial

Pravin Dhandre

20 Jun 2018

12 min read

Splunk: How to work with multiple indexes [Tutorial]

Pravin Dhandre

20 Jun 2018

12 min read

An index in Splunk is a storage pool for events, capped by size and time. By default, all events will go to the index specified by defaultDatabase, which is called main but lives in a directory called defaultdb. In this tutorial, we put focus to index structures, need of multiple indexes, how to size an index and how to manage multiple indexes in a Splunk environment. This article is an excerpt from a book written by James D. Miller titled Implementing Splunk 7 - Third Edition. Directory structure of an index Each index occupies a set of directories on the disk. By default, these directories live in $SPLUNK_DB, which, by default, is located in $SPLUNK_HOME/var/lib/splunk. Look at the following stanza for the main index: [main] homePath = $SPLUNK_DB/defaultdb/db coldPath = $SPLUNK_DB/defaultdb/colddb thawedPath = $SPLUNK_DB/defaultdb/thaweddb maxHotIdleSecs = 86400 maxHotBuckets = 10 maxDataSize = auto_high_volume If our Splunk installation lives at /opt/splunk, the index main is rooted at the path /opt/splunk/var/lib/splunk/defaultdb. To change your storage location, either modify the value of SPLUNK_DB in $SPLUNK_HOME/etc/splunk-launch.conf or set absolute paths in indexes.conf. splunk-launch.conf cannot be controlled from an app, which means it is easy to forget when adding indexers. For this reason, and for legibility, I would recommend using absolute paths in indexes.conf. The homePath directories contain index-level metadata, hot buckets, and warm buckets. coldPath contains cold buckets, which are simply warm buckets that have aged out. See the upcoming sections The lifecycle of a bucket and Sizing an index for details. When to create more indexes There are several reasons for creating additional indexes. If your needs do not meet one of these requirements, there is no need to create more indexes. In fact, multiple indexes may actually hurt performance if a single query needs to open multiple indexes. Testing data If you do not have a test environment, you can use test indexes for staging new data. This then allows you to easily recover from mistakes by dropping the test index. Since Splunk will run on a desktop, it is probably best to test new configurations locally, if possible. Differing longevity It may be the case that you need more history for some source types than others. The classic example here is security logs, as compared to web access logs. You may need to keep security logs for a year or more, but need the web access logs for only a couple of weeks. If these two source types are left in the same index, security events will be stored in the same buckets as web access logs and will age out together. To split these events up, you need to perform the following steps: Create a new index called security, for instance Define different settings for the security index Update inputs.conf to use the new index for security source types For one year, you might make an indexes.conf setting such as this: [security] homePath = $SPLUNK_DB/security/db coldPath = $SPLUNK_DB/security/colddb thawedPath = $SPLUNK_DB/security/thaweddb #one year in seconds frozenTimePeriodInSecs = 31536000 For extra protection, you should also set maxTotalDataSizeMB, and possibly coldToFrozenDir. If you have multiple indexes that should age together, or if you will split homePath and coldPath across devices, you should use volumes. See the upcoming section, Using volumes to manage multiple indexes, for more information. Then, in inputs.conf, you simply need to add an index to the appropriate stanza as follows: [monitor:///path/to/security/logs/logins.log] sourcetype=logins index=security Differing permissions If some data should only be seen by a specific set of users, the most effective way to limit access is to place this data in a different index, and then limit access to that index by using a role. The steps to accomplish this are essentially as follows: Define the new index. Configure inputs.conf or transforms.conf to send these events to the new index. Ensure that the user role does not have access to the new index. Create a new role that has access to the new index. Add specific users to this new role. If you are using LDAP authentication, you will need to map the role to an LDAP group and add users to that LDAP group. To route very specific events to this new index, assuming you created an index called sensitive, you can create a transform as follows: [contains_password] REGEX = (?i)password[=:] DEST_KEY = _MetaData:Index FORMAT = sensitive You would then wire this transform to a particular sourcetype or source index in props.conf. Using more indexes to increase performance Placing different source types in different indexes can help increase performance if those source types are not queried together. The disks will spend less time seeking when accessing the source type in question. If you have access to multiple storage devices, placing indexes on different devices can help increase the performance even more by taking advantage of different hardware for different queries. Likewise, placing homePath and coldPath on different devices can help performance. However, if you regularly run queries that use multiple source types, splitting those source types across indexes may actually hurt performance. For example, let's imagine you have two source types called web_access and web_error. We have the following line in web_access: 2012-10-19 12:53:20 code=500 session=abcdefg url=/path/to/app And we have the following line in web_error: 2012-10-19 12:53:20 session=abcdefg class=LoginClass If we want to combine these results, we could run a query like the following: (sourcetype=web_access code=500) OR sourcetype=web_error | transaction maxspan=2s session | top url class If web_access and web_error are stored in different indexes, this query will need to access twice as many buckets and will essentially take twice as long. The life cycle of a bucket An index is made up of buckets, which go through a specific life cycle. Each bucket contains events from a particular period of time. The stages of this lifecycle are hot, warm, cold, frozen, and thawed. The only practical difference between hot and other buckets is that a hot bucket is being written to, and has not necessarily been optimized. These stages live in different places on the disk and are controlled by different settings in indexes.conf: homePath contains as many hot buckets as the integer value of maxHotBuckets, and as many warm buckets as the integer value of maxWarmDBCount. When a hot bucket rolls, it becomes a warm bucket. When there are too many warm buckets, the oldest warm bucket becomes a cold bucket. Do not set maxHotBuckets too low. If your data is not parsing perfectly, dates that parse incorrectly will produce buckets with very large time spans. As more buckets are created, these buckets will overlap, which means all buckets will have to be queried every time, and performance will suffer dramatically. A value of five or more is safe. coldPath contains cold buckets, which are warm buckets that have rolled out of homePath once there are more warm buckets than the value of maxWarmDBCount. If coldPath is on the same device, only a move is required; otherwise, a copy is required. Once the values of frozenTimePeriodInSecs, maxTotalDataSizeMB, or maxVolumeDataSizeMB are reached, the oldest bucket will be frozen. By default, frozen means deleted. You can change this behavior by specifying either of the following: coldToFrozenDir: This lets you specify a location to move the buckets once they have aged out. The index files will be deleted, and only the compressed raw data will be kept. This essentially cuts the disk usage by half. This location is unmanaged, so it is up to you to watch your disk usage. coldToFrozenScript: This lets you specify a script to perform some action when the bucket is frozen. The script is handed the path to the bucket that is about to be frozen. thawedPath can contain buckets that have been restored. These buckets are not managed by Splunk and are not included in all time searches. To search these buckets, their time range must be included explicitly in your search. I have never actually used this directory. Search https://splunk.com for restore archived to learn the procedures. Sizing an index To estimate how much disk space is needed for an index, use the following formula: (gigabytes per day) * .5 * (days of retention desired) Likewise, to determine how many days you can store an index, the formula is essentially: (device size in gigabytes) / ( (gigabytes per day) * .5 ) The .5 represents a conservative compression ratio. The log data itself is usually compressed to 10 percent of its original size. The index files necessary to speed up search brings the size of a bucket closer to 50 percent of the original size, though it is usually smaller than this. If you plan to split your buckets across devices, the math gets more complicated unless you use volumes. Without using volumes, the math is as follows: homePath = (maxWarmDBCount + maxHotBuckets) * maxDataSize coldPath = maxTotalDataSizeMB - homePath For example, say we are given these settings: [myindex] homePath = /splunkdata_home/myindex/db coldPath = /splunkdata_cold/myindex/colddb thawedPath = /splunkdata_cold/myindex/thaweddb maxWarmDBCount = 50 maxHotBuckets = 6 maxDataSize = auto_high_volume #10GB on 64-bit systems maxTotalDataSizeMB = 2000000 Filling in the preceding formula, we get these values: homePath = (50 warm + 6 hot) * 10240 MB = 573440 MB coldPath = 2000000 MB - homePath = 1426560 MB If we use volumes, this gets simpler and we can simply set the volume sizes to our available space and let Splunk do the math. Using volumes to manage multiple indexes Volumes combine pools of storage across different indexes so that they age out together. Let's make up a scenario where we have five indexes and three storage devices. The indexes are as follows: Name Data per day Retention required Storage needed web 50 GB no requirement ? security 1 GB 2 years 730 GB * 50 percent app 10 GB no requirement ? chat 2 GB 2 years 1,460 GB * 50 percent web_summary 1 GB 1 years 365 GB * 50 percent Now let's say we have three storage devices to work with, mentioned in the following table: Name Size small_fast 500 GB big_fast 1,000 GB big_slow 5,000 GB We can create volumes based on the retention time needed. Security and chat share the same retention requirements, so we can place them in the same volumes. We want our hot buckets on our fast devices, so let's start there with the following configuration: [volume:two_year_home] #security and chat home storage path = /small_fast/two_year_home maxVolumeDataSizeMB = 300000 [volume:one_year_home] #web_summary home storage path = /small_fast/one_year_home maxVolumeDataSizeMB = 150000 For the rest of the space needed by these indexes, we will create companion volume definitions on big_slow, as follows: [volume:two_year_cold] #security and chat cold storage path = /big_slow/two_year_cold maxVolumeDataSizeMB = 850000 #([security]+[chat])*1024 - 300000 [volume:one_year_cold] #web_summary cold storage path = /big_slow/one_year_cold maxVolumeDataSizeMB = 230000 #[web_summary]*1024 - 150000 Now for our remaining indexes, whose timeframe is not important, we will use big_fast and the remainder of big_slow, like so: [volume:large_home] #web and app home storage path = /big_fast/large_home maxVolumeDataSizeMB = 900000 #leaving 10% for pad [volume:large_cold] #web and app cold storage path = /big_slow/large_cold maxVolumeDataSizeMB = 3700000 #(big_slow - two_year_cold - one_year_cold)*.9 Given that the sum of large_home and large_cold is 4,600,000 MB, and a combined daily volume of web and app is 60,000 MB approximately, we should retain approximately 153 days of web and app logs with 50 percent compression. In reality, the number of days retained will probably be larger. With our volumes defined, we now have to reference them in our index definitions: [web] homePath = volume:large_home/web coldPath = volume:large_cold/web thawedPath = /big_slow/thawed/web [security] homePath = volume:two_year_home/security coldPath = volume:two_year_cold/security thawedPath = /big_slow/thawed/security coldToFrozenDir = /big_slow/frozen/security [app] homePath = volume:large_home/app coldPath = volume:large_cold/app thawedPath = /big_slow/thawed/app [chat] homePath = volume:two_year_home/chat coldPath = volume:two_year_cold/chat thawedPath = /big_slow/thawed/chat coldToFrozenDir = /big_slow/frozen/chat [web_summary] homePath = volume:one_year_home/web_summary coldPath = volume:one_year_cold/web_summary thawedPath = /big_slow/thawed/web_summary thawedPath cannot be defined using a volume and must be specified for Splunk to start. For extra protection, we specified coldToFrozenDir for the indexes' security and chat. The buckets for these indexes will be copied to this directory before deletion, but it is up to us to make sure that the disk does not fill up. If we allow the disk to fill up, Splunk will stop indexing until space is made available. This is just one approach to using volumes. You could overlap in any way that makes sense to you, as long as you understand that the oldest bucket in a volume will be frozen first, no matter what index put the bucket in that volume. With this, we learned to operate multiple indexes and how we can get effective business intelligence out of the data without hurting system performance. If you found this tutorial useful, do check out the book Implementing Splunk 7 - Third Edition and start creating advanced Splunk dashboards. Splunk leverages AI in its monitoring tools Splunk’s Input Methods and Data Feeds Splunk Industrial Asset Intelligence (Splunk IAI) targets Industrial IoT marketplace

0
0
17592

article-image-handling-backup-and-recovery-in-postgresql-10

Savia Lobo

18 Jun 2018

11 min read

Handling backup and recovery in PostgreSQL 10 [Tutorial]

Savia Lobo

18 Jun 2018

11 min read

Performing backups should be a regular task and every administrator is supposed to keep an eye on this vital stuff. Fortunately, PostgreSQL provides an easy means to create backups. In this tutorial, you will learn how to backup data by performing some simple dumps and also recover them using PostgreSQL 10. This article is an excerpt taken from, 'Mastering PostgreSQL 10' written by Hans-Jürgen Schönig. This book highlights the newly introduced features in PostgreSQL 10, and shows you how to build better PostgreSQL applications. Performing simple dumps If you are running a PostgreSQL setup, there are two major methods to perform backups: Logical dumps (extract an SQL script representing your data) Transaction log shipping The idea behind transaction log shipping is to archive binary changes made to the database. Most people claim that transaction log shipping is the only real way to do backups. However, in my opinion, this is not necessarily true. Many people rely on pg_dump to simply extract a textual representation of the data. pg_dump is also the oldest method of creating a backup and has been around since the very early days of the project (transaction log shipping was added much later). Every PostgreSQL administrator will become familiar with pg_dump sooner or later, so it is important to know how it really works and what it does. Running pg_dump The first thing we want to do is to create a simple textual dump: [hs@linuxpc ~]$ pg_dump test > /tmp/dump.sql This is the most simplistic backup you can imagine. pg_dump logs into the local database instance connects to a database test and starts to extract all the data, which will be sent to stdout and redirected to the file. The beauty is that standard output gives you all the flexibility of a Unix system. You can easily compress the data using a pipe or do whatever you want. In some cases, you might want to run pg_dump as a different user. All PostgreSQL client programs support a consistent set of command-line parameters to pass user information. If you just want to set the user, use the -U flag: [hs@linuxpc ~]$ pg_dump -U whatever_powerful_user test > /tmp/dump.sql The following set of parameters can be found in all PostgreSQL client programs: ... Connection options: -d, --dbname=DBNAME database to dump -h, --host=HOSTNAME database server host or socket directory -p, --port=PORT database server port number -U, --username=NAME connect as specified database user -w, --no-password never prompt for password -W, --password force password prompt (should happen automatically) --role=ROLENAME do SET ROLE before dump ... Just pass the information you want to pg_dump, and if you have enough permissions, PostgreSQL will fetch the data. The important thing here is to see how the program really works. Basically, pg_dump connects to the database and opens a large repeatable read transaction that simply reads all the data. Remember, repeatable read ensures that PostgreSQL creates a consistent snapshot of the data, which does not change throughout the transactions. In other words, a dump is always consistent—no foreign keys will be violated. The output is a snapshot of data as it was when the dump started. Consistency is a key factor here. It also implies that changes made to the data while the dump is running won't make it to the backup anymore. A dump simply reads everything—therefore, there are no separate permissions to be able to dump something. As long as you can read it, you can back it up. Also, note that the backup is by default in a textual format. This means that you can safely extract data from say, Solaris, and move it to some other CPU architecture. In the case of binary copies, that is clearly not possible as the on-disk format depends on your CPU architecture. Passing passwords and connection information If you take a close look at the connection parameters shown in the previous section, you will notice that there is no way to pass a password to pg_dump. You can enforce a password prompt, but you cannot pass the parameter to pg_dump using a command-line option. The reason for that is simple: the password might show up in the process table and be visible to other people. Therefore, this is not supported. The question now is: if pg_hba.conf on the server enforces a password, how can the client program provide it? There are various means of doing that: Making use of environment variables Making use of .pgpass Using service files In this section, you will learn about all three methods. Using environment variables One way to pass all kinds of parameters is to use environment variables. If the information is not explicitly passed to pg_dump, it will look for the missing information in predefined environment variables. A list of all potential settings can be found here: https://www.postgresql.org/docs/10/static/libpq-envars.html. The following overview shows some environment variables commonly needed for backups: PGHOST: It tells the system which host to connect to PGPORT: It defines the TCP port to be used PGUSER: It tells a client program about the desired user PGPASSWORD: It contains the password to be used PGDATABASE: It is the name of the database to connect to The advantage of these environments is that the password won't show up in the process table. However, there is more. Consider the following example: psql -U ... -h ... -p ... -d ... Suppose you are a system administrator: do you really want to type a long line like that a couple of times every day? If you are working with the very same host again and again, just set those environment variables and connect with plain SQL: [hs@linuxpc ~]$ export PGHOST=localhost [hs@linuxpc ~]$ export PGUSER=hs [hs@linuxpc ~]$ export PGPASSWORD=abc [hs@linuxpc ~]$ export PGPORT=5432 [hs@linuxpc ~]$ export PGDATABASE=test [hs@linuxpc ~]$ psql psql (10.1) Type "help" for help. As you can see, there are no command-line parameters anymore. Just type psql and you are in. All applications based on the standard PostgreSQLC-language client library (libpq) will understand these environment variables, so you cannot only use them for psql and pg_dump, but for many other applications. Making use of .pgpass A very common way to store login information is via the use of .pgpass files. The idea is simple: put a file called .pgpass into your home directory and put your login information there. The format is simple: hostname:port:database:username:password An example would be: 192.168.0.45:5432:mydb:xy:abc PostgreSQL offers some nice additional functionality: most fields can contain *. Here is an example: *:*:*:xy:abc This means that on every host, on every port, for every database, the user called xy will use abc as the password. To make PostgreSQL use the .pgpass file, make sure that the right file permissions are in place: chmod 0600 ~/.pgpass .pgpass can also be used on a Windows system. In this case, the file can be found in the %APPDATA%postgresqlpgpass.conf path. Using service files However, there is not just the .pgpass file. You can also make use of service files. Here is how it works. If you want to connect to the very same servers over and over again, you can create a .pg_service.conf file. It will hold all the connection information you need. Here is an example of a .pg_service.conf file: Mac:~ hs$ cat .pg_service.conf # a sample service [hansservice] host=localhost port=5432 dbname=test user=hs password=abc [paulservice] host=192.168.0.45 port=5432 dbname=xyz user=paul password=cde To connect to one of the services, just set the environment and connect: iMac:~ hs$ export PGSERVICE=hansservice A connection can now be established without passing parameters to psql: iMac:~ hs$ psql psql (10.1) Type "help" for help. test=# Alternatively, you can use: psql service=hansservice Extracting subsets of data Up to now, you have seen how to dump an entire database. However, this is not what you might wish for. In many cases, you might just want to extract a subset of tables or schemas. pg_dump can do that and provides a number of switches: -a: It dumps only the data and does not dump the data structure -s: It dumps only the data structure but skips the data -n: It dumps only a certain schema -N: It dumps everything but excludes certain schemas -t: It dumps only certain tables -T: It dumps everything but certain tables (this can make sense if you want to exclude logging tables and so on) Partial dumps can be very useful to speed things up considerably. Handling various formats So far, you have seen that pg_dump can be used to create text files. The problem is that a text file can only be replayed completely. If you have saved an entire database, you can only replay the entire thing. In many cases, this is not what you want. Therefore, PostgreSQL has additional formats that also offer more functionality. At this point, four formats are supported: -F, --format=c|d|t|p output file format (custom, directory, tar, plain text (default)) You have already seen plain, which is just normal text. On top of that, you can use a custom format. The idea behind a custom format is to have a compressed dump, including a table of contents. Here are two ways to create a custom format dump: [hs@linuxpc ~]$ pg_dump -Fc test > /tmp/dump.fc [hs@linuxpc ~]$ pg_dump -Fc test -f /tmp/dump.fc In addition to the table of contents, the compressed dump has one more advantage: it is a lot smaller. The rule of thumb is that a custom format dump is around 90% smaller than the database instance you are about to back up. Of course, this highly depends on the number of indexes and all that, but for many database applications, this rough estimation will hold true. Once you have created the backup, you can inspect the backup file: [hs@linuxpc ~]$ pg_restore --list /tmp/dump.fc ; ; Archive created at 2017-11-04 15:44:56 CET ; dbname: test ; TOC Entries: 18 ; Compression: -1 ; Dump Version: 1.12-0 ; Format: CUSTOM ; Integer: 4 bytes ; Offset: 8 bytes ; Dumped from database version: 10.1 ; Dumped by pg_dump version: 10.1 ; ; Selected TOC Entries: ; 3103; 1262 16384 DATABASE - test hs 3; 2615 2200 SCHEMA - public hs 3104; 0 0 COMMENT - SCHEMA public hs 1; 3079 13350 EXTENSION - plpgsql 3105; 0 0 COMMENT - EXTENSION plpgsql 187; 1259 16391 TABLE public t_test hs ... pg_restore --list will return the table of contents of the backup. Using a custom format is a good idea as the backup will shrink in size. However, there is more; the -Fd command will create a backup in a directory format. Instead of a single file, you will now get a directory containing a couple of files: [hs@linuxpc ~]$ mkdir /tmp/backup [hs@linuxpc ~]$ pg_dump -Fd test -f /tmp/backup/ [hs@linuxpc ~]$ cd /tmp/backup/ [hs@linuxpc backup]$ ls -lh total 86M -rw-rw-r--. 1 hs hs 85M Jan 4 15:54 3095.dat.gz -rw-rw-r--. 1 hs hs 107 Jan 4 15:54 3096.dat.gz -rw-rw-r--. 1 hs hs 740K Jan 4 15:54 3097.dat.gz -rw-rw-r--. 1 hs hs 39 Jan 4 15:54 3098.dat.gz -rw-rw-r--. 1 hs hs 4.3K Jan 4 15:54 toc.dat One advantage of the directory format is that you can use more than one core to perform the backup. In the case of a plain or custom format, only one process will be used by pg_dump. The directory format changes that rule. The following example shows how you can tell pg_dump to use four cores (jobs): [hs@linuxpc backup]$ rm -rf * [hs@linuxpc backup]$ pg_dump -Fd test -f /tmp/backup/ -j 4 Note that the more objects you have in your database, the more potential speedup there will be. To summarize, you learned about creating backups in general. If you've enjoyed reading this post, do check out 'Mastering PostgreSQL 10' to know how to replay backup and handle global data in PostgreSQL10. You will learn how to use PostgreSQL onboard tools to replicate instances. PostgreSQL 11 Beta 1 is out! How to perform data partitioning in PostgreSQL 10 How to implement Dynamic SQL in PostgreSQL 10

0
0
31136

article-image-data-professionals-planning-to-learn-this-year-python-deep-learning

Amey Varangaonkar

14 Jun 2018

4 min read

What are data professionals planning to learn this year? Python, deep learning, yes. But also...

Amey Varangaonkar

14 Jun 2018

4 min read

One thing that every data professional absolutely dreads is the day their skills are no longer relevant in the market. In an ever-changing tech landscape, one must be constantly on the lookout for the most relevant, industrially-accepted tools and frameworks. This is applicable everywhere - from application and web developers to cybersecurity professionals. Not even the data professionals are excluded from this, as new ways and means to extract actionable insights from raw data are being found out almost every day. Gone are the days when data pros stuck to a single language and a framework to work with their data. Frameworks are more flexible now, with multiple dependencies across various tools and languages. Not just that, new domains are being identified where these frameworks can be applied, and how they can be applied varies massively as well. A whole new arena of possibilities has opened up, and with that new set of skills and toolkits to work on these domains have also been unlocked. What’s the next big thing for data professionals? We recently polled thousands of data professionals as part of our Skill-Up program, and got some very interesting insights into what they think the future of data science looks like. We asked them what they were planning to learn in the next 12 months. The following word cloud is the result of their responses, weighted by frequency of the tools they chose: What data professionals are planning on learning in the next 12 months Unsurprisingly, Python comes out on top as the language many data pros want to learn in the coming months. With its general-purpose nature and innumerable applications across various use-cases, Python’s sky-rocketing popularity is the reason everybody wants to learn it. Machine learning and AI are finding significant applications in the web development domain today. They are revolutionizing the customers’ digital experience through conversational UIs or chatbots. Not just that, smart machine learning algorithms are being used to personalize websites and their UX. With all these reasons, who wouldn’t want to learn JavaScript, as an important tool to have in their data science toolkit? Add to that the trending web dev framework Angular, and you have all the tools to build smart, responsive front-end web applications. We also saw data professionals taking active interest in the mobile and cloud domains as well. They aim to learn Kotlin and combine its power with the data science tools for developing smarter and more intelligent Android apps. When it comes to the cloud, Microsoft’s Azure platform has introduced many built-in machine learning capabilities, as well as a workbench for data scientists to develop effective, enterprise-grade models. Data professionals also prefer Docker containers to run their applications seamlessly, and hence its learning need seems to be quite high. [box type="shadow" align="" class="" width=""]Has machine learning with JavaScript caught your interest? Don’t worry, we got you covered - check out Hands-on Machine Learning with JavaScript for a practical, hands-on coverage of the essential machine learning concepts using the leading web development language. [/box] With Crypto’s popularity off the roof (sadly, we can’t say the same about Bitcoin’s price), data pros see Blockchain as a valuable skill. Building secure, decentralized apps is on the agenda for many, perhaps. Cloud, Big Data, Artificial Intelligence are some of the other domains that the data pros find interesting, and feel worth skilling up in. Work-related skills that data pros want to learn We also asked the data professionals what skills the data pros wanted to learn in the near future that could help them with their daily jobs more effectively. The following word cloud of their responses paints a pretty clear picture: Valuable skills data professionals want to learn for their everyday work As Machine learning and AI go mainstream, so do their applications in mainstream domains - often resulting in complex problems. Well, there’s deep learning and specifically neural networks to tackle these problems, and these are exactly the skills data pros want to master in order to excel at their work. [box type="shadow" align="" class="" width=""]Data pros want to learn Machine Learning in Python. Do you? Here’s a useful resource for you to get started - check out Python Machine Learning, Second Edition today![/box] So, there it is! What are the tools, languages or frameworks that you are planning to learn in the coming months? Do you agree with the results of the poll? Do let us know. What are web developers favorite front-end tools? Packt’s Skill Up report reveals all Data cleaning is the worst part of data analysis, say data scientists 15 Useful Python Libraries to make your Data Science tasks Easier

0
0
42899

article-image-alarming-ways-governments-use-surveillance-tech

Neil Aitken

14 Jun 2018

12 min read

Alarming ways governments are using surveillance tech to watch you

Neil Aitken

14 Jun 2018

12 min read

Mapquest, part of the Verizon company, is the second largest provider of mapping services in the world, after Google Maps. It provides advanced cartography services to companies like Snap and PapaJohns pizza. The company is about to release an app that users can install on their smartphone. Their new application will record and transmit video images of what’s happening in front of your vehicle, as you travel. Data can be sent from any phone with a camera – using the most common of tools – a simple mobile data plan, for example. In exchange, you’ll get live traffic updates, among other things. Mapquest will use the video image data they gather to provide more accurate and up to date maps to their partners. The real world is changing all the time – roads get added, cities re-route traffic from time to time. The new AI based technology Mapquest employ could well improve the reliability of driverless cars, which have to engage with this ever changing landscape, in a safe manner. No-one disagrees with safety improvements. Mapquests solution is impressive technology. The fact that they can use AI to interpret the images they see and upload the information they receive to update maps is incredible. And, in this regard, the company is just one of the myriad daily news stories which excite and astound us. These stories do, however, often have another side to them which is rarely acknowledged. In the wrong hands, Mapquest’s solution could create a surveillance database which tracked people in real time. Surveillance technology involves the use of data and information products to capture details about individuals. The act of surveillance is usually undertaken with a view to achieving a goal. The principle is simple. The more ‘they’ know about you, the easier it will be to influence you towards their ends. Surveillance information can be used to find you, apprehend you or potentially, to change your mind, without even realising that you had been watched. Mapquest’s innovation is just a single example of surveillance technology in government hands which has expanded in capability far beyond what most people realise. Read also: What does the US government know about you? The truth beyond the Facebook scandal Facebook’s share price fell 14% in early 2018 as a result of public outcry related to the Cambridge Analytica announcements the company made. The idea that a private company had allowed detailed information about individuals to be provided to a third party without their consent appeared to genuinely shock and appall people. Technology tools like Mapquest’s tracking capabilities and Facebook’s profiling techniques are being taken and used by police forces and corporate entities around the world. The reality of current private and public surveillance capabilities is that facilities exist, and are in use, to collect and analyse data on most people in the developing world. The known limits of these services may surprise even those who are on the cutting edge of technology. There are so many examples from all over the world listed below that will genuinely make you want to consider going off grid! Innovative, Ingenious overlords: US companies have a flare for surveillance The US is the centre for information based technology companies. Much of what they develop is exported as well as used domestically. The police are using human genome matching to track down criminals and can find ‘any family in the country’ There have been 2 recent examples of police arresting a suspect after using human genome databases to investigate crimes. A growing number of private individuals have now used publicly available services such as 23andme to sequence their genome (DNA) either to investigate further their family tree, or to determine the potential of a pre-disposition to the gene based component of a disease. In one example, The Golden State Killer, an ex cop, was arrested 32 years after the last reported rape in a series of 45 (in addition to 12 murders) which occurred between 1976 and 1986. To track him down, police approached sites like 23andme with DNA found at crime scenes, established a family match and then progressed the investigation using conventional means. More than 12 million Americans have now used a genetic sequencing service and it is believed that investigators could find a family match for the DNA of anyone who has committed a crime in America. In simple terms, whether you want it or not, the law enforcement has the DNA of every individual in the country available to them. Domain Awareness Centers (DAC) bring the Truman Show to life The 400,000 Residents of Oakland, California discovered in 2012, that they had been the subject of an undisclosed mass surveillance project, by the local police force, for many years. Feeds from CCTV cameras installed in Oakland’s suburbs were augmented with weather information feeds, social media feeds and extracted email conversations, as well as a variety of other sources. The scheme began at Oakland’s port with Federal funding as part of a national response to the events of 9.11.2001 but was extended to cover the near half million residents of the city. Hundreds of additional video cameras were installed, along with gunshot recognition microphones and some of the other surveillance technologies provided in this article. The police force conducting the surveillance had no policy on what information was recorded or for how long it was kept. Internet connected toys spy on children The FBI has warned Americans that children’s toys connected to the internet ‘could put the privacy and safety of children at risk.' Children’s toy Hello Barbie was specifically admonished for poor privacy controls as part of the FBI’s press release. Internet connected toys could be used to record video of children at any point in the day or, conceivably, to relay a human voice, making it appear to the child that the toy was talking to them. Oracle suggest Google’s Android operating system routinely tracks users’ position even when maps are turned off In Australia, two American companies have been involved in a disagreement about the potential monitoring of Android phones. Oracle accused Google of monitoring users’ location (including altitude), even when mapping software is turned off on the device. The tracking is performed in the background of their phone. In Australia alone, Oracle suggested that Google’s monitoring could involve around 1GB of additional mobile data every month, costing users nearly half a billion dollars a year, collectively. Amazon facial recognition in real time helps US law enforcement services Amazon are providing facial recognition services which take a feed from public video cameras, to a number of US Police Forces. Amazon can match images taken in real time to a database containing ‘millions of faces.’ Are there any state or Federal rules in place to govern police facial recognition? Wired reported that there are ‘more or less none.’ Amazon’s scheme is a trial taking place in Florida. There are at least 2 other companies offering similar schemes in the US to law enforcement services. Big glass microphone can help agencies keep an ear on the ground Project ‘Big Glass Microphone’ uses the vibrations that the movements of cars (among other things) cause in the buried fiber optic telecommunications links. A successful test of the technology has been undertaken on the fiber optic cables which run underground on the Stanford University Campus, to record vehicle movements. Fiber optic links now make up the backbone of much data transport infrastructure - the way your phone and computer connect to the internet. Big glass microphone as it stands is the first step towards ‘invisible’ monitoring of people and their assets. It appears the FBI now have the ability to crack/access any phone Those in the know suggest that Apple’s iPhone is the most secure smart device against government surveillance. In 2016, this was put to the test. The Justice Department came into possession of an iPhone allegedly belonging to one of the San Bernadino shooters and ultimately sued Apple in an attempt to force the company to grant access to it, as part of their investigation. The case was ultimately dropped leading some to speculate that NAND mirroring techniques were used to gain access to the phone without Apple’s assistance, implying that even the most secure phones can now be accessed by authorities. Cornell University’s lie detecting algorithm Groundbreaking work by Cornell University will provide ‘at a distance’ access to information that previously required close personal access to an accused subject. Cornell’s solution interprets feeds from a number of video cameras on subjects and analyses the results to judge their heart rate. They believe the system can be used to determine if someone is lying from behind a screen. University of Southern California can anticipate social unrest with social media feeds Researchers at the University Of Southern California have developed an AI tool to study Social Media posts and determine whether those writing them are likely to cause Social Unrest. The software claims to have identified an association between both the volume of tweets written / the content of those tweets and protests turning physical. They can now offer advice to law enforcement on the likelihood of a protest turning violent so they can be properly prepared based on this information. The UK, an epicenter of AI progress, is not far behind in tracking people The UK has a similarly impressive array of tools at its disposal to watch the people that representatives of the country feels may be required. Given the close levels of cooperation between the UK and US governments, it is likely that many of these UK facilities are shared with the US and other NATO partners. Project stingray – fake cell phone/mobile phone ‘towers’ to intercept communications Stingray is a brand name for an IMSI (the unique identifier on a SIM card) tracker. They ‘spoof’ real towers, presenting themselves as the closest mobile phone tower. This ‘fools’ phones in to connecting to them. The technology has been used to spy on criminals in the UK but it is not just the UK government which use Stingray or its equivalents. The Washington Post reported in June 2018 that a number of domestically compiled intelligence reports suggest that foreign governments acting on US soil, including China and Russia, have been eavesdropping on the Whitehouse, using the same technology. UK developed Spyware is being used by authoritarian regimes Gamma International is a company based in Hampshire UK, which provided the (notably authoritarian) Egyptian government with a facility to install what was effectively spyware delivered with a virus on to computers in their country. Once installed, the software permitted the government to monitor private digital interactions, without the need to engage the phone company or ISP offering those services. Any internet based technology could be tracked, assisting in tracking down individuals who may have negative feelings about the Egyptian government. Individual arrested when his fingerprint was taken from a WhatsApp picture of his hand A Drug Dealer was pictured holding an assortment of pills in the UK two months ago. The image of his hand was used to extract an image of his fingerprint. From that, forensic scientists used by UK police, confirmed that officers had arrested the correct person and associated him with drugs. AI solutions to speed up evidence processing including scanning laptops and phones UK police forces are trying out AI software to speed up processing evidence from digital devices. A dozen departments around the UK are using software, called Cellebrite, which employs AI algorithms to search through data found on devices, including phones and laptops. Cellbrite can recognize images that contain child abuse, accepts feeds from multiple devices to see when multiple owners were in the same physical location at the same time and can read text from screenshots. Officers can even feed it photos of suspects to see if a picture of them show up on someone’s hard drive. China takes the surveillance biscuit and may show us a glimpse of the future There are 600 million mobile phone users in China, each producing a great deal of information about their users. China has a notorious record of human rights abuses and the ruling Communist Party takes a controlling interest (a board seat) in many of their largest technology companies, to ensure the work done is in the interest of the party as well as profitable for the corporate. As a result, China is on the front foot when it comes to both AI and surveillance technology. China’s surveillance tools could be a harbinger of the future in the Western world. Chinese cities will be run by a private company Alibaba, China’s equivalent of Amazon, already has control over the traffic lights in one Chinese city, Hangzhou. Alibaba is far from shy about it’s ambitions. It has 120,000 developers working on the problem and intends to commercialise and sell the data it gathers about citizens. The AI based product they’re using is called CityBrain. In the future, all Chinese cities could well all be run by AI from the Alibaba corporation the idea is to use this trial as a template for every city. The technology is likely to be placed in Kuala Lumpur next. In the areas under CityBrain’s control, traffic speeds have increased by 15% already. However, some of those observing the situation have expressed concerns not just about (the lack of) oversight on CityBrain’s current capabilities but the potential for future abuse. What to make of this incredible list of surveillance capabilities Facilities like Mapquest’s new mapping service are beguiling. They’re clever ideas which create a better works. Similar technology, however, behind the scenes, is being adopted by law enforcement bodies in an ever growing list of countries. Even for someone who understands cutting edge technology, the sum of those facilities may be surprising. Literally any aspect of your behaviour, from the way you walk, to your face, your heatmap and, of course, the contents of your phone and laptops can now be monitored. Law enforcement can access and review information feeds with Artificial Intelligence software, to process and summarise findings quickly. In some cases, this is being done without the need for a warrant. Concerningly, these advances seem to be coming without policy or, in many cases any form of oversight. We must change how we think about AI, urge AI founding fathers

0
0
25733

article-image-create-connection-qlik-engine-tip

Amey Varangaonkar

13 Jun 2018

8 min read

5 ways to create a connection to the Qlik Engine [Tip]

Amey Varangaonkar

13 Jun 2018

8 min read

With mashups or web apps, the Qlik Engine sits outside of your project and is not accessible and loaded by default. The first step before doing anything else is to create a connection with the Qlik Engine, after which you can continue to open a session and perform further actions on that app, such as: Opening a document/app Making selections Retrieving visualizations and apps For using the Qlik Engine API, open a WebSocket to the engine. There may be a difference in the way you do this, depending on whether you are working with Qlik Sense Enterprise or Qlik Sense Desktop. In this article, we will elaborate on how you can achieve a connection to the Qlik engine and the benefits of doing so. The following excerpt has been taken from the book Mastering Qlik Sense, authored by Martin Mahler and Juan Ignacio Vitantonio. Creating a connection To create a connection using WebSockets, you first need to establish a new web socket communication line. To open a WebSocket to the engine, use one of the following URIs: Qlik Sense Enterprise Qlik Sense Desktop wss://server.domain.com:4747/app/ or wss://server.domain.com[/virtual proxy]/app/ ws://localhost:4848/app Creating a Connection using WebSockets In the case of Qlik Sense Desktop, all you need to do is define a WebSocket variable, including its connection string in the following way: var ws = new WebSocket("ws://localhost:4848/app/"); Once the connection is opened and checking for ws.open(), you can call additional methods to the engine using ws.send(). This example will retrieve the number of available documents in my Qlik Sense Desktop environment, and append them to an HTML list: <html> <body> <ul id='docList'> </ul> </body> </html> <script> var ws = new WebSocket("ws://localhost:4848/app/"); var request = { "handle": -1, "method": "GetDocList", "params": {}, "outKey": -1, "id": 2 } ws.onopen = function(event){ ws.send(JSON.stringify(request)); // Receive the response ws.onmessage = function (event) { var response = JSON.parse(event.data); if(response.method != ' OnConnected'){ var docList = response.result.qDocList; var list = ''; docList.forEach(function(doc){ list += '<li>'+doc.qDocName+'</li>'; }) document.getElementById('docList').innerHTML = list; } } } </script> The preceding example will produce the following output on your browser if you have Qlik Sense Desktop running in the background: All Engine methods and calls can be tested in a user-friendly way by exploring the Qlik Engine in the Dev Hub. A single WebSocket connection can be associated with only one engine session (consisting of the app context, plus the user). If you need to work with multiple apps, you must open a separate WebSocket for each one. If you wish to create a WebSocket connection directly to an app, you can extend the configuration URL to include the application name, or in the case of the Qlik Sense Enterprise, the GUID. You can then use the method from the app class and any other classes as you continue to work with objects within the app. var ws = new WebSocket("ws://localhost:4848/app/MasteringQlikSense.qvf"); Creating Connection to the Qlik Server Engine Connecting to the engine on a Qlik Sense environment is a little bit different as you will need to take care of authentication first. Authentication is handled in different ways, depending on how you have set up your server configuration, with the most common ones being: Ticketing Certificates Header authentication Authentication also depends on where the code that is interacting with the Qlik Engine is running. If your code is running on a trusted computer, authentication can be performed in several ways, depending on how your installation is configured and where the code is running: If you are running the code from a trusted computer, you can use certificates, which first need to be exported via the QMC If the code is running on a web browser, or certificates are not available, then you must authenticate via the virtual proxy of the server Creating a connection using certificates Certificates can be considered as a seal of trust, which allows you to communicate with the Qlik Engine directly with full permission. As such, only backend solutions ever have access to certificates, and you should guard how you distribute them carefully. To connect using certificates, you first need to export them via the QMC, which is a relatively easy thing to do: Once they are exported, you need to copy them to the folder where your project is located using the following code: <html> <body> <h1>Mastering QS</h1> </body> <script> var certPath = path.join('C:', 'ProgramData', 'Qlik', 'Sense', 'Repository', 'Exported Certificates', '.Local Certificates'); var certificates = { cert: fs.readFileSync(path.resolve(certPath, 'client.pem')), key: fs.readFileSync(path.resolve(certPath, 'client_key.pem')), root: fs.readFileSync(path.resolve(certPath, 'root.pem')) }; // Open a WebSocket using the engine port (rather than going through the proxy) var ws = new WebSocket('wss://server.domain.com:4747/app/', { ca: certificates.root, cert: certificates.cert, key: certificates.key, headers: { 'X-Qlik-User': 'UserDirectory=internal; UserId=sa_engine' } }); ws.onopen = function (event) { // Call your methods } </script> Creating a connection using the Mashup API Now, while connecting to the engine is a fundamental step to start interacting with Qlik, it's very low-level, connecting via WebSockets. For advanced use cases, the Mashup API is one way to help you get up to speed with a more developer-friendly abstraction layer. The Mashup API utilizes the qlik interface as an external interface to Qlik Sense, used for mashups and for including Qlik Sense objects in external web pages. To load the qlik module, you first need to ensure RequireJS is available in your main project file. You will then have to specify the URL of your Qlik Sense environment, as well as the prefix of the virtual proxy, if there is one: <html> <body> <h1>Mastering QS</h1> </body> </html> <script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.5/require.min.js"> <script> //Prefix is used for when a virtual proxy is used with the browser. var prefix = window.location.pathname.substr( 0, window.location.pathname.toLowerCase().lastIndexOf( "/extensions" ) + 1 ); //Config for retrieving the qlik.js module from the Qlik Sense Server var config = { host: window.location.hostname, prefix: prefix, port: window.location.port, isSecure: window.location.protocol === "https:" }; require.config({ baseUrl: (config.isSecure ? "https://" : "http://" ) + config.host + (config.port ? ":" + config.port : "" ) + config.prefix + "resources" }); require(["js/qlik"], function (qlik) { qlik.setOnError( function (error) { console.log(error); }); //Open an App var app = qlik.openApp('MasteringQlikSense.qvf', config); </script> Once you have created the connection to an app, you can start leveraging the full API by conveniently creating HyperCubes, connecting to fields, passing selections, retrieving objects, and much more. The Mashup API is intended for browser-based projects where authentication is handled in the same way as if you were going to open Qlik Sense. If you wish to use the Mashup API, or some parts of it, with a backend solution, you need to take care of authentication first. Creating a connection using enigma.js Enigma is Qlik's open-source promise wrapper for the engine. You can use enigma directly when you're in the Mashup API, or you can load it as a separate module. When you are writing code from within the Mashup API, you can retrieve the correct schema directly from the list of available modules which are loaded together with qlik.js via 'autogenerated/qix/engine-api'. The following example will connect to a Demo App using enigma.js: define(function () { return function () { require(['qlik','enigma','autogenerated/qix/engine-api'], function (qlik, enigma, schema) { //The base config with all details filled in var config = { schema: schema, appId: "My Demo App.qvf", session:{ host:"localhost", port: 4848, prefix: "", unsecure: true, }, } //Now that we have a config, use that to connect to the //QIX service. enigma.getService("qix" , config).then(function(qlik){ qlik.global.openApp(config.appId) //Open App qlik.global.openApp(config.appId).then(function(app){ //Create SessionObject for FieldList app.createSessionObject( { qFieldListDef: { qShowSystem: false, qShowHidden: false, qShowSrcTables: true, qShowSemantic: true, qShowDerivedFields: true }, qInfo: { qId: "FieldList", qType: "FieldList" } } ).then( function(list) { return list.getLayout(); } ).then( function(listLayout) { return listLayout.qFieldList.qItems; } ).then( function(fieldItems) { console.log(fieldItems) } ); }) } })}}) It's essential to also load the correct schema whenever you load enigma.js. The schema is a collection of the available API methods that can be utilized in each version of Qlik Sense. This means your schema needs to be in sync with your QS version. Thus, we see it is fairly easy to create a stable connection with the Qlik Engine API. If you liked the above excerpt, make sure you check out the book Mastering Qlik Sense to learn more tips and tricks on working with different kinds of data using Qlik Sense and extract useful business insights. How Qlik Sense is driving self-service Business Intelligence Overview of a Qlik Sense® Application’s Life Cycle What we learned from Qlik Qonnections 2018

0
0
17925

article-image-how-to-prevent-errors-while-using-utilities-for-loading-data-in-teradata

Pravin Dhandre

11 Jun 2018

9 min read

How to prevent errors while using utilities for loading data in Teradata

Pravin Dhandre

11 Jun 2018

9 min read

In today’s tutorial we will assist you to overcome the errors that arise while loading, deleting or updating large volumes of data using Teradata Utilities. [box type="note" align="" class="" width=""]This article is an excerpt from Teradata Cookbook co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati. This book provides recipes to simplify the daily tasks performed by database administrators (DBA) along with providing efficient data warehousing solutions in Teradata database system.[/box] Resolving FastLoad error 2652 When data is being loaded via FastLoad, a table lock is placed on the target table. This means that the table is unavailable for any other operation. A lock on a table is only released when FastLoad encounters the END LOADING command, which terminates phase 2, the so-called application phase. FastLoad may get terminated in phase 1 due to any of the following reasons: Load script results in failure (error code 8 or 12) Load script is aborted by admin or some other session FastLoad fails due to bad record or file Forgetting to add end loading statement in script If so, it keeps a lock on the table, which needs to be released manually. In this recipe, we will see the steps to release FastLoad locks. Getting ready Identify the table on which FastLoad is been ended prematurely and tables are in locked state. You need to have valid credentials for the Teradata Database. Execute the dummy FastLoad script from the same user or the user which has write access to the lock table. A user requires the following privileges/rights in order to execute the FastLoad: SELECT and INSERT (CREATE and DROP or DELETE) access to the target or loading table CREATE and DROP TABLE on error tables SELECT, INSERT, UPDATE, and DELETE are required privileges for the user PUBLIC on the restart log table (SYSADMIN.FASTLOG). There will be a row in the FASTLOG table for each FastLoad job that has not completed in the system. How to do it... Open a notepad and create the following script: .LOGON 127.0.0.1/dbc, dbc; /* Vaild system name and credentials to your system */ .DATABASE Database_Name; /* database under which locked table is */ erorfiles errortable_name, uv_tablename /* same error table name as in script */ begin loading locked_table; /* table which is getting 2652 error */ .END LOADING; /* to end pahse 2 and release the lock */ .LOGOFF; Save it as dummy_fl.txt. Open the windows Command Prompt and execute this using the FastLoad command, as shown in the following screenshot: This dummy script with no insert statement should release the lock on the target Table. Execute Select on the locked table to see if the lock is released on the table. How it works... As FastLoad is designed to work only on empty tables, it becomes necessary that the loading of the table finishes in one go. If the load script is errored out prematurely in phase 2, without encountering the END loading command, it leaves a lock on loading the table. Fastload locks can't be released via the HUT utility, as there are no technical lock on the table. To execute FastLoad, the following are some requirements: Log table: FastLoad puts its progress information in the fastlog table. EMPTY TABLE: FastLoad needs the table to be empty before inserting rows into that table. TWO ERROR TABLES: FastLoad requires two error tables to be created; you just need to name them, and no ddl is required. The first error table records any translation or constraint violation error, whereas the second error table captures errors related to the duplication of values for Unique Primary Indexes (UPI). After the completion of FastLoad, you can analyze these error tables as to why the records got rejected. There's more... If this does not fix the issue, you need to drop the target table and error tables associated with it. Before proceeding with dropping tables, check with the administrator to abort any FastLoad sessions associated with this table. Resolving MLOAD error 2571 MLOAD works in five phases, unlike FastLoad, which only works in two phases. MLOAD can fail in either phase three or four. Figure shows 5 stages of MLOAD. Preliminary: Basic setup. Syntax checking, establishing session with the Teradata Database, creation of error tables (two error tables per target table), and the creation of work tables and log tables are done in this phase. DML Transaction phase: Request is parse through PE and a step plan is generated. Steps and DML are then sent to AMP and stored in appropriate work tables for each target table. Input data sent will be stored in these work tables, which will be applied to the target table later on. Acquisition phase: Unsorted data is sent to AMP in blocks of 64K. Rows are hashed by PI and sent to appropriate AMPs. Utility places locks on target tables in preparation for the application phase to apply rows in target tables. Application phase: Changes are applied to target tables and NUSI subtables. Lock on table is held in this phase. Cleanup phase: If the error code of all the steps is 0, MLOAD successfully completes and releases all the locks on the specified table. This being the case, all empty error tables, worktables, and the log table are dropped. Getting ready Identify the table which is getting affected by error 2571. Make sure no host utility is running on this table and the load job is in a failed state for this table. How to do it... Check on viewpoint for any active utility job for this table. If you find any active job, let it complete. If there is a reason that you need to release the lock, first abort all the sessions of the host utility from viewpoint. Ask your administrator to do it. Execute the following command: RELEASE MLOAD <databasename.tablename>; > If you get a Not able to release MLOAD Lock error, execute the following Command: /* Release lock in application phase */ RELEASE MLOAD <databasename.tablename> in apply; Once the locks are released you need to drop all the associated error tables, the log table, and work tables with it. Re-execute MLOAD after correcting the error. How it works... The Mload utility places a lock in table headers to alert other utilities that a MultiLoad is in session for this table. They include: Acquisition lock: DML allows all DDL allows DROP only Application lock: DML allows SELECT with ACCESS only DDL allows DROP only There's more... If the release lock statement still gives an error and does not release the lock on the table, you need to use SELECT with the ACCESS lock to copy the content of the locked table to a new one and drop the locked tables. If you start receiving the error 7446 Mload table %ID cannot be released because NUSI exists, you need to drop all the NUSI on the table and use ALTER Table to nonfallback to accomplish the task. Resolving failure 7547 This error is associated with the UPDATE statement, which could be SQL based or could be in MLOAD. Various times, while updating the set of rows in a table, the update fails on Failure 7547 Target row updated by multiple source rows. This error will happen when you update the target with multiple rows from the source. This means there are duplicated values present in the source tables. Getting ready Let's create sample volatile tables and insert values into them. After that, we will execute the UPDATE command, which will fail to result in 7547: Create a TARGET TABLE with the following DDL and insert values into it: ** TARGET TABLE** create volatile table accounts ( CUST_ID, CUST_NAME, Sal )with data primary index(cust_id) insert values (1,'will',2000); insert values (2,'bekky',2800); insert values (3,'himesh',4000); Create a SOURCE TABLE with the following DDL and insert values into it: ** SOURCE TABLE** create volatile table Hr_payhike ( CUST_ID, CUST_NAME, Sal_hike ) with data primary index(cust_id) insert values (1,'will',2030); insert values (1,'bekky',3800); insert values (3,'himesh',7000); Execute the MLOAD script. Following the snippet from the MLOAD script, only update part (which will fail): /* Snippet from MLOAD update */ UPDATE ACC FROM ACCOUNTS ACC , Hr_payhike SUPD SET Sal= TUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Failure: Target row updated by multiple source rows How to do it... Check for duplicate values in the source table using the following: /*Check for duplicate values in source table*/ SELECT cust_id,count(*) from Hr_payhike group by 1 order by 2 desc The output will be generated with CUST_ID =1 and has two values which are causing errors. The reason for this is that while updating the TARGET table, the optimizer won't be able to understand from which row it should update the TARGET row. Who's salary will be updated Will or Bekky? To resolve the error, execute the following update query: /* Update part of MLOAD */ UPDATE ACC FROM ACCOUNTS ACC , ( SELECT CUST_ID, CUST_NAME, SAL_HIKE FROM Hr_payhike QUALIFY ROW_NUMBER() OVER (PARTITION BY CUST_ID ORDER BY CUST_NAME,SAL_HIKE DESC)=1) SUPD SET Sal= SUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Now, the update will run without error. How it works... Failure will happen when you update the target with multiple rows from the source. If you defined a primary index column for your target, and if those columns are in an update query condition, this error will occur. To further resolve this, you can delete the duplicate from the source table itself and execute the original update without any modification. But if the source data can't be changed, then you need to change the update statement. To summarize, we have successfully learned how to overcome or prevent errors while using utilities for loading data into database. You could also check out the Teradata Cookbook for more than 100 recipes on enterprise data warehousing solutions. 2018 is the year of graph databases. Here’s why. 6 reasons to choose MySQL 8 for designing database solutions Amazon Neptune, AWS’ cloud graph database, is now generally available

0
0
18203

article-image-3-ways-to-use-indexes-in-teradata-to-improve-database-performance

Pravin Dhandre

11 Jun 2018

15 min read

3 ways to use Indexes in Teradata to improve database performance

Pravin Dhandre

11 Jun 2018

15 min read

0
0
20332

article-image-feedforward-networks-tensorflow

Aarthi Kumaraswamy

07 Jun 2018

12 min read

Implementing feedforward networks with TensorFlow

Aarthi Kumaraswamy

07 Jun 2018

12 min read

0
0
15658

article-image-how-tflearn-makes-building-tensorflow-models-easier

Savia Lobo

04 Jun 2018

7 min read

How TFLearn makes building TensorFlow models easier

Savia Lobo

04 Jun 2018

7 min read

Today, we will introduce you to TFLearn, and will create layers and models which are directly beneficial in any model implementation with Tensorflow. TFLearn is a modular library in Python that is built on top of core TensorFlow. [box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn how to build TensorFlow models to work with multilayer perceptrons using Keras, TFLearn, and R.[/box] TIP: TFLearn is different from the TensorFlow Learn package which is also known as TF Learn (with one space in between TF and Learn). It is available at the following link; and the source code is available on GitHub. TFLearn can be installed in Python 3 with the following command: pip3 install tflearn Note: To install TFLearn in other environments or from source, please refer to the following link: http://tflearn.org/installation/ The simple workflow in TFLearn is as follows: Create an input layer first. Pass the input object to create further layers. Add the output layer. Create the net using an estimator layer such as regression. Create a model from the net created in the previous step. Train the model with the model.fit() method. Use the trained model to predict or evaluate. Creating the TFLearn Layers Let us learn how to create the layers of the neural network models in TFLearn: Create an input layer first: input_layer = tflearn.input_data(shape=[None,num_inputs] Pass the input object to create further layers: layer1 = tflearn.fully_connected(input_layer,10, activation='relu') layer2 = tflearn.fully_connected(layer1,10, activation='relu') Add the output layer: output = tflearn.fully_connected(layer2,n_classes, activation='softmax') Create the final net from the estimator layer such as regression: net = tflearn.regression(output, optimizer='adam', metric=tflearn.metrics.Accuracy(), loss='categorical_crossentropy' ) The TFLearn provides several classes for layers that are described in following sub-sections. TFLearn core layers TFLearn offers the following layers in the tflearn.layers.core module: Layer classDescriptioninput_dataThis layer is used to specify the input layer for the neural network.fully_connectedThis layer is used to specify a layer where all the neurons are connected to all the neurons in the previous layer.dropoutThis layer is used to specify the dropout regularization. The input elements are scaled by 1/keep_prob while keeping the expected sum unchanged.Layer classDescriptioncustom_layerThis layer is used to specify a custom function to be applied to the input. This class wraps our custom function and presents the function as a layer.reshapeThis layer reshapes the input into the output of specified shape.flattenThis layer converts the input tensor to a 2D tensor.activationThis layer applies the specified activation function to the input tensor.single_unitThis layer applies the linear function to the inputs.highwayThis layer implements the fully connected highway function.one_hot_encodingThis layer converts the numeric labels to their binary vector one-hot encoded representations.time_distributedThis layer applies the specified function to each time step of the input tensor.multi_target_dataThis layer creates and concatenates multiple placeholders, specifically used when the layers use targets from multiple sources. TFLearn convolutional layers TFLearn offers the following layers in the tflearn.layers.conv module: Layer classDescriptionconv_1dThis layer applies 1D convolutions to the input dataconv_2dThis layer applies 2D convolutions to the input dataconv_3dThis layer applies 3D convolutions to the input dataconv_2d_transposeThis layer applies transpose of conv2_d to the input dataconv_3d_transposeThis layer applies transpose of conv3_d to the input dataatrous_conv_2dThis layer computes a 2-D atrous convolutiongrouped_conv_2dThis layer computes a depth-wise 2-D convolutionmax_pool_1dThis layer computes 1-D max poolingmax_pool_2dThis layer computes 2D max poolingavg_pool_1dThis layer computes 1D average poolingavg_pool_2dThis layer computes 2D average poolingupsample_2dThis layer applies the row and column wise 2-D repeat operationupscore_layerThis layer implements the upscore as specified in http://arxiv. org/abs/1411.4038global_max_poolThis layer implements the global max pooling operationglobal_avg_poolThis layer implements the global average pooling operationresidual_blockThis layer implements the residual block to create deep residual networksresidual_bottleneckThis layer implements the residual bottleneck block for deep residual networksresnext_blockThis layer implements the ResNeXt block TFLearn recurrent layers TFLearn offers the following layers in the tflearn.layers.recurrent module: Layer classDescriptionsimple_rnnThis layer implements the simple recurrent neural network modelbidirectional_rnnThis layer implements the bi-directional RNN modellstmThis layer implements the LSTM modelgruThis layer implements the GRU model TFLearn normalization layers TFLearn offers the following layers in the tflearn.layers.normalization module: Layer classDescriptionbatch_normalizationThis layer normalizes the output of activations of previous layers for each batchlocal_response_normalizationThis layer implements the LR normalizationl2_normalizationThis layer applies the L2 normalization to the input tensors TFLearn embedding layers TFLearn offers only one layer in the tflearn.layers.embedding_ops module: Layer classDescriptionembeddingThis layer implements the embedding function for a sequence of integer IDs or floats TFLearn merge layers TFLearn offers the following layers in the tflearn.layers.merge_ops module: Layer classDescriptionmerge_outputsThis layer merges the list of tensors into a single tensor, generally used to merge the output tensors of the same shapemergeThis layer merges the list of tensors into a single tensor; you can specify the axis along which the merge needs to be done TFLearn estimator layers TFLearn offers only one layer in the tflearn.layers.estimator module: Layer classDescriptionregressionThis layer implements the linear or logistic regression While creating the regression layer, you can specify the optimizer and the loss and metric functions. TFLearn offers the following optimizer functions as classes in the tflearn.optimizers module: SGD RMSprop Adam Momentum AdaGrad Ftrl AdaDelta ProximalAdaGrad Nesterov Note: You can create custom optimizers by extending the tflearn.optimizers.Optimizer base class. TFLearn offers the following metric functions as classes or ops in the tflearn.metrics module: Accuracy or accuracy_op Top_k or top_k_op R2 or r2_op WeightedR2 or weighted_r2_op Binary_accuracy_op Note : You can create custom metrics by extending the tflearn.metrics.Metric base class. TFLearn provides the following loss functions, known as objectives, in the tflearn.objectives module: Softymax_categorical_crossentropy categorical_crossentropy binary_crossentropy Weighted_crossentropy mean_square hinge_loss roc_auc_score Weak_cross_entropy_2d While specifying the input, hidden, and output layers, you can specify the activation functions to be applied to the output. TFLearn provides the following activation functions in the tflearn.activations module: linear tanh Sigmoid softmax softplus Softsign relu relu6 leaky_relu Prelu elu Crelu selu Creating the TFLearn Model Create the model from the net created in the previous step (step 4 in creating the TFLearn layers section): model = tflearn.DNN(net) Types of TFLearn models The TFLearn offers two different classes of the models: DNN (Deep Neural Network) model: This class allows you to create a multilayer perceptron from the network that you have created from the layers SequenceGenerator model: This class allows you to create a deep neural network that can generate sequences Training the TFLearn Model After creating, train the model with the model.fit() method: model.fit(X_train, Y_train, n_epoch=n_epochs, batch_size=batch_size, show_metric=True, run_id='dense_model') Using the TFLearn Model Use the trained model to predict or evaluate: score = model.evaluate(X_test, Y_test) print('Test accuracy:', score[0]) The complete code for the TFLearn MNIST classification example is provided in the notebook ch-02_TF_High_Level_Libraries. The output from the TFLearn MNIST example is as follows: Training Step: 5499 | total loss: 0.42119 | time: 1.817s | Adam | epoch: 010 | loss: 0.42119 - acc: 0.8860 -- iter: 54900/55000 Training Step: 5500 | total loss: 0.40881 | time: 1.820s | Adam | epoch: 010 | loss: 0.40881 - acc: 0.8854 -- iter: 55000/55000 -- Test accuracy: 0.9029 Note: You can get more information about TFLearn from the following link: http://tflearn.org/. To summarize, we got to know about TFLearn and the different TFLearn layers and models. If you found this post useful, do check out this book Mastering TensorFlow 1.x, to explore advanced features of TensorFlow 1.x, and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet. TensorFlow.js 0.11.1 releases! How to Build TensorFlow Models for Mobile and Embedded devices Distributed TensorFlow: Working with multiple GPUs and servers

0
0
24333

article-image-data-cleaning-worst-part-of-data-analysis

Amey Varangaonkar

04 Jun 2018

5 min read

Data cleaning is the worst part of data analysis, say data scientists

Amey Varangaonkar

04 Jun 2018

5 min read

The year was 2012. Harvard Business Review had famously declared the role of data scientist as the ‘sexiest job of the 21st century’. Companies were slowly working with more data than ever before. The real actionable value of the data that could be used for commercial purposes was slowly beginning to uncover. Someone who could derive these actionable insights from the data was needed. The demand for data scientists was higher than ever. Fast forward to 2018 - more data has been collected in the last 2 years than ever before. Data scientists are still in high demand, and the need for insights is higher than ever. There has been one significant change, though - the process of deriving insights has become more complex. If you ask the data scientists, the first initial phase of this process, which involves data cleansing, has become a lot more cumbersome. So much so, that it is no longer a myth that data scientists spend almost 80% of their time cleaning and readying the data for analysis. Why data cleaning is a nightmare In the recently conducted Packt Skill-Up survey, we asked data professionals what the worst part of the data analysis process was, and a staggering 50% responded with data cleaning. Source: Packt Skill Up Survey We dived deep into this, and tried to understand why many data science professionals have this common feeling of dislike towards data cleaning, or scrubbing - as many call it. Read the Skill Up report in full. Sign up to our weekly newsletter and download the PDF for free. There is no consistent data format Organizations these days work with a lot of data. Some of it is in a structured, readily understandable format. This kind of data is usually quite easy to clean, parse and analyze. However, some of the data is really messy, and cannot be used as is for analysis. This includes missing data, irregularly formatted data, and irrelevant data which is not worth analyzing at all. There is also the problem of working with unstructured data which needs to be pre-processed to get the data worth analyzing. Audio or video files, email messages, presentations, xml documents and web pages are some classic examples of this. There’s too much data to be cleaned The volume of data that businesses deal with on a day to day basis is in the scale of terabytes or even petabytes. Making sense of all this data, coming from a variety of sources and in different formats is, undoubtedly, a huge task. There are a whole host of tools designed to ease this process today, but it remains an incredibly tricky challenge to sift through the large volumes of data and prepare it for analysis. Data cleaning is tricky and time-consuming Data cleansing can be quite an exhaustive and time-consuming task, especially for data scientists. Cleaning the data requires removal of duplications, removing or replacing missing entries, correcting misfielded values, ensuring consistent formatting and a host of other tasks which take a considerable amount of time. Once the data is cleaned, it needs to be placed in a secure location. Also, a log of the entire process needs to be kept to ensure the right data goes through the right process. All of this requires the data scientists to create a well-designed data scrubbing framework to avoid the risk of repetition. All of this is more of a grunt work and requires a lot of manual effort. Sadly, there are no tools in the market which can effectively automate this process. Outsourcing the process is expensive Given that data cleaning is a rather tedious job, many businesses think of outsourcing the task to third party vendors. While this reduces a lot of time and effort on the company’s end, it definitely increases the cost of the overall process. Many small and medium scale businesses may not be able to afford this, and thus are heavily reliant on the data scientist to do the job for them. You can hate it, but you cannot ignore it It is quite obvious that data scientists need clean, ready-to-analyze data if they are to to extract actionable business insights from it. Some data scientists equate data cleaning to donkey work, suggesting there’s not a lot of innovation involved in this process. However, some believe data cleaning is rather important, and pay special attention to it given once it is done right, most of the problems in data analysis are solved. It is very difficult to take advantage of the intrinsic value offered by the dataset if it does not adhere to the quality standards set by the business, making data cleaning a crucial component of the data analysis process. Now that you know why data cleaning is essential, why not dive deeper into the technicalities? Check out our book Practical Data Wrangling for expert tips on turning your noisy data into relevant, insight-ready information using R and Python. Read more Cleaning Data in PDF Files 30 common data science terms explained How to create a strong data science project portfolio that lands you a job

0
2
62843

How-To Tutorials - Data

Configuring and deploying HBase [Tutorial]

Administration rights for Power BI users

Build and train an RNN chatbot using TensorFlow [Tutorial]

How to interact with HBase using HBase shell [Tutorial]

Use TensorFlow and NLP to detect duplicate Quora questions [Tutorial]

Splunk: How to work with multiple indexes [Tutorial]

Handling backup and recovery in PostgreSQL 10 [Tutorial]

What are data professionals planning to learn this year? Python, deep learning, yes. But also...

Alarming ways governments are using surveillance tech to watch you

5 ways to create a connection to the Qlik Engine [Tip]

Trending Topics

How to prevent errors while using utilities for loading data in Teradata

3 ways to use Indexes in Teradata to improve database performance

Implementing feedforward networks with TensorFlow

How TFLearn makes building TensorFlow models easier

Data cleaning is the worst part of data analysis, say data scientists

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading