Data | Tech News, Tutorials & Expert Insights

article-image-tensorflow-2-0-released-tighter-keras-integration-eager-execution-enabled-by-default

03 Oct 2019

5 min read

TensorFlow 2.0 released with tighter Keras integration, eager execution enabled by default, and more!

03 Oct 2019

After releasing the beta version of TensorFlow 2.0 in June, Google announced its final release on Monday. This release comes with tighter integration with Keras, eager execution enabled by default, promises three times faster training performance, a cleaned-up API, and more. Key updates in TensorFlow 2.0 Tighter Keras integration for better developer productivity One of the important updates in TensorFlow 2.0 is its tighter integration with Keras, a popular high-level API used for easy and fast prototyping, building, and training deep learning models. This will enable developers to easily leverage its various model-building APIs including Sequential, Functional, and Subclassing. Explaining the motivation behind this change, the TensorFlow team wrote, “By establishing Keras as the high-level API for TensorFlow, we are making it easier for developers new to machine learning to get started with TensorFlow. A single high-level API reduces confusion and enables us to focus on providing advanced capabilities for researchers.” Eager execution enabled by default In TensorFlow 1.x, developers were required to define an abstract data structure named Graph and to run this graph they needed an encapsulation called Session. TensorFlow 2.0 has eager execution enabled by default to “eagerly” run code, similar to normal Python code. Eager execution enables fast iteration and intuitive debugging without building a graph. It also makes creating and experimenting with models using TensorFlow much easier. It can be especially useful when using the tf.keras model subclassing API. Also Read: Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out Distribution Strategy API The Distribution Strategy API in TensorFlow 2.0 allows machine learning researchers to distribute training across a wide variety of compute configurations. This will allow them to “attain great out-of-the-box performance” with minimal code changes. This release also allows distributed training with Keras’ model.fit and custom training loops. Performance improvements on GPUs TensorFlow 2.0 includes multi-GPU support and experimental support for multi worker and Cloud TPUs. This release also has a number of performance improvements on GPUs. It promises three times faster training performance when using mixed precision on NVIDIA’s Volta and Turing GPUs. It includes tight integration with NVIDIA TensorRT, a platform for high-performance deep learning inference. The standardized SavedModel file format The SavedModel API allows you to save your trained ML model into a language-neutral format. With TensorFlow 2.0, all TensorFlow ecosystem projects including TensorFlow Lite, TensorFlow JS, TensorFlow Serving, and TensorFlow Hub, support SavedModels. Standardizing the SavedModel file format will enable developers to run their models on a variety of runtimes including the cloud, web, browser, Node.js, mobile, and embedded systems. “This allows you to run your models with TensorFlow, deploy them with TensorFlow Serving, use them on mobile and embedded systems with TensorFlow Lite, and train and run in the browser or Node.js with TensorFlow.js,” the team writes. API simplification TensorFlow 2.0 includes a number of API updates. Many API symbols are removed or renamed for better consistency and clarity. Also, the tf.app, tf.flags, and tf.logging API are removed in favor of abseil-py. Because of the huge number of API changes, developers in a discussion on Hacker News expressed that transitioning from TensorFlow 1.X to TensorFlow 2.0 is quite complicated. Some also mentioned switching to PyTorch instead. A user commented, “As someone who uses TensorFlow a lot, I predict an enormous clusterfuck of a transition. Tensorflow has turned into a multiheaded monster, supporting many things and approaches but none of them very well...In my opinion, there are some architectural problems with TF, which have not been addressed in this update...If you need to transition from TF1 to TF2, consider doing the TF1 to PyTorch transition instead.” While some others were happy with the recommended Keras API and eager execution. “I don't know if I'm the only one, but I actually love the changes they've made since v1. Eager execution and tf.function are fantastic, and the built-in Keras is even better than the standalone version. A big improvement compared to TF from last year,” a user commented on Reddit. Another user added, “The most important change in terms of usability, IMO, is the use of tf.keras as the recommended interface to TensorFlow. There hasn't been a case yet where I've needed to dip outside of Keras into raw TensorFlow, but the option is there and is easy to do. That said, TF 2.0 changes a lot. Many repos might break, so expect to see lots of tensorflow==1.14 in requirement.txt files from now on.” These were some of the updates in TensorFlow 2.0. Check out the official announcement and release notes to know more in detail. Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages TensorFlow 2.0 to be released soon with eager execution, removal of redundant APIs, tf function and more Introducing TensorFlow Graphics packed with TensorBoard 3D, object transformations, and much more Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial] Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial]

0
0
32074

article-image-transformers-2-0-nlp-library-with-deep-interoperability-between-tensorflow-2-0-and-pytorch

Fatema Patrawala

30 Sep 2019

3 min read

Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages

Fatema Patrawala

30 Sep 2019

3 min read

Last week, Hugging Face, a startup specializing in natural language processing, released a landmark update to their popular Transformers library, offering unprecedented compatibility between two major deep learning frameworks, PyTorch and TensorFlow 2.0. Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. Transformers 2.0 embraces the ‘best of both worlds’, combining PyTorch’s ease of use with TensorFlow’s production-grade ecosystem. The new library makes it easier for scientists and practitioners to select different frameworks for the training, evaluation and production phases of developing the same language model. “This is a lot deeper than what people usually think when they talk about compatibility,” said Thomas Wolf, who leads Hugging Face’s data science team. “It’s not only about being able to use the library separately in PyTorch and TensorFlow. We’re talking about being able to seamlessly move from one framework to the other dynamically during the life of the model.” https://twitter.com/Thom_Wolf/status/1177193003678601216 “It’s the number one feature that companies asked for since the launch of the library last year,” said Clement Delangue, CEO of Hugging Face. Notable features in Transformers 2.0 8 architectures with over 30 pretrained models, in more than 100 languages Load a model and pre-process a dataset in less than 10 lines of code Train a state-of-the-art language model in a single line with the tf.keras fit function Share pretrained models, reducing compute costs and carbon footprint Deep interoperability between TensorFlow 2.0 and PyTorch models Move a single model between TF2.0/PyTorch frameworks at will Seamlessly pick the right framework for training, evaluation, production As powerful and concise as Keras About Hugging Face Transformers With half a million installs since January 2019, Transformers is the most popular open-source NLP library. More than 1,000 companies including Bing, Apple or Stitchfix are using it in production for text classification, question-answering, intent detection, text generation or conversational. Hugging Face, the creators of Transformers, have raised US$5M so far from investors in companies like Betaworks, Salesforce, Amazon and Apple. On Hacker News, users are appreciating the company and how Transformers has become the most important library in NLP. Other interesting news in data Baidu open sources ERNIE 2.0, a continual pre-training NLP model that outperforms BERT and XLNet on 16 NLP tasks Dr Joshua Eckroth on performing Sentiment Analysis on social media platforms using CoreNLP Facebook open-sources PyText, a PyTorch based NLP modeling framework

0
0
29544

article-image-can-a-modified-mit-hippocratic-license-to-restrict-misuse-of-open-source-software-prompt-a-wave-of-ethical-innovation-in-tech

Savia Lobo

24 Sep 2019

5 min read

Can a modified MIT ‘Hippocratic License’ to restrict misuse of open source software prompt a wave of ethical innovation in tech?

Savia Lobo

24 Sep 2019

5 min read

Open source licenses allow software to be freely distributed, modified, and used. These licenses give developers an additional advantage of allowing others to use their software as per their own rules and conditions. Recently, software developer and open-source advocate Coraline Ada Ehmke has caused a stir in the software engineering community with ‘The Hippocratic License.’ Ehmke was also the original author of Contributor Covenant, a “code of conduct" for open source projects that encourages participants to use inclusive language and to refrain from personal attacks and harassment. In a tweet posted in September last year, following the code of conduct, she mentioned, “40,000 open source projects, including Linux, Rails, Golang, and everything OSS produced by Google, Microsoft, and Apple have adopted my code of conduct.” [box type="shadow" align="" class="" width=""]The term ‘Hippocratic’ is derived from the Hippocratic Oath, the most widely known of Greek medical texts. The Hippocratic Oath in literal terms requires a new physician to swear upon a number of healing gods that he will uphold a number of professional ethical standards.[/box] Ehmke explained the license in more detail in a post published on Sunday. In it, she highlights how the idea that writing software with the goals of clarity, conciseness, readability, performance, and elegance are limiting, and potentially dangerous.“All of these technologies are inherently political,” she writes. “There is no neutral political position in technology. You can’t build systems that can be weaponized against marginalized people and take no responsibility for them.”The concept of the Hippocratic license is relatively simple. In a tweet, Ehmke said that it “specifically prohibits the use of open-source software to harm others.” Open source software and the associated harm Out of the many privileges that open source software allows such as free redistribution of the software as well as the source code, the OSI also defines there is no discrimination against who uses it or where it will be put to use. A few days ago, a software engineer, Seth Vargo pulled his open-source software, Chef-Sugar, offline after finding out that Chef (a popular open source DevOps company using the software) had recently signed a contract selling $95,000-worth of licenses to the US Immigrations and Customs Enforcement (ICE), which has faced widespread condemnation for separating children from their parents at the U.S. border and other abuses. Vargo took down the Chef Sugar library from both GitHub and RubyGems, the main Ruby package repository, as a sign of protest. In May, this year, Mijente, an advocacy organization released documents stating that Palantir was responsible for the 2017 ICE operation that targeted and arrested family members of children crossing the border alone. Also, in May 2018, Amazon employees, in a letter to Jeff Bezos, protested against the sale of its facial recognition tech to Palantir where they “refuse to contribute to tools that violate human rights”, citing the mistreatment of refugees and immigrants by ICE. Also, in July, the WYNC revealed that Palantir’s mobile app FALCON was being used by ICE to carry out raids on immigrant communities as well as enable workplace raids in New York City in 2017. Founder of OSI responds to Ehmke’s Hippocratic License Bruce Perens, one of the founders of the Open Source movement in software, responded to Ehmke in a post titled “Sorry, Ms. Ehmke, The “Hippocratic License” Can’t Work” . “The software may not be used by individuals, corporations, governments, or other groups for systems or activities that actively and knowingly endanger harm, or otherwise threaten the physical, mental, economic, or general well-being of underprivileged individuals or groups,” he highlights in his post. “The terms are simply far more than could be enforced in a copyright license,” he further adds. “Nobody could enforce Ms. Ehmke’s license without harming someone, or at least threatening to do so. And it would be easy to make a case for that person being underprivileged,” he continued. He concluded saying that, though the terms mentioned in Ehmke’s license were unagreeable, he will “happily support Ms. Ehmke in pursuit of legal reforms meant to achieve the protection of underprivileged people.” Many have welcomed Ehmke's idea of an open source license with an ethical clause. However, the license is not OSI approved yet and chances are slim after Perens’ response. There are many users who do not agree with the license. Reaching a consensus will be hard. https://twitter.com/seannalexander/status/1175853429325008896 https://twitter.com/AdamFrisby/status/1175867432411336704 https://twitter.com/rishmishra/status/1175862512509685760 Even though developers host their source code on open source repositories, a license may bring certain level of restrictions on who is allowed to use the code. However, as Perens mentions, many of the terms in Ehmke’s license hard to implement. Irrespective of the outcome of this license’s approval process, Coraline Ehmke has widely opened up the topic of the need for long overdue FOSS licensing reforms in the open source community. It would be interesting to see if such a license would boost ethical reformation by giving more authority to the developers in imbibing their values and preventing the misuse of their software. Read the Hippocratic license to know more in detail. Other interesting news Tech ImageNet Roulette: New viral app trained using ImageNet exposes racial biases in artificial intelligent system Machine learning ethics: what you need to know and what you can do Facebook suspends tens of thousands of apps amid an ongoing investigation into how apps use personal data

0
0
25773

article-image-machine-learning-ethics-what-you-need-to-know-and-what-you-can-do

Richard Gall

23 Sep 2019

10 min read

Machine learning ethics: what you need to know and what you can do

Richard Gall

23 Sep 2019

10 min read

0
0
36118

article-image-bad-metadata-can-get-you-in-legal-hot-water

Guest Contributor

21 Sep 2019

6 min read

Bad Metadata can get you in legal hot water

Guest Contributor

21 Sep 2019

6 min read

Metadata isn't just something that concerns business intelligence and IT teams; but lawyers are extremely interested in it as well. Metadata, it turns out, can win or lose lawsuits, send politicians to jail, and even decide medical malpractice cases. It's not uncommon for attorneys who conduct discovery of electronic records in organizations to find that the claims of plaintiffs or defendants are contradicted by metadata, like time and date, type of data, etc. If a discovery process is initiated against them, an organization had better be sure that its metadata is in order. All it would take for an organization to lose a case would be for an attorney to discover a discrepancy in different databases – a different time stamp on some communication, a different job title for a principal in the case. Such discrepancies could lead to accusations of data tampering, fraud, or worse – and would most definitely put the organization in a very tough position versus a judge or jury. Metadata errors are difficult to spot The problem with that, of course, is that catching metadata errors is extremely difficult. In large organizations, data is stored in repositories that are spread throughout the organization, maybe even the world – in different departments. Each department is responsible for maintaining its own database, and the metadata in it; and on different cloud storage repositories, which may have their own system of classifying data. An enterprising attorney could have a field day with the different categories and tags data is stored under, making claims that the organization is trying to “hide something.” The organization's only defense: We're poor administrators. That may not be enough to impress the court. Types of Metadata Metadata is “data about data,” and comes in three flavors: System Metadata, which is data that is automatically generated from the computer and includes specific labeled criteria, like the date and time of creation and date a document was modified, etc. Substantive Metadata reflects changes to a document, like tracked changes. Embedded metadata is data entered into a document or file but not normally visible, such as formulas in cells in an Excel spreadsheet. All of these have increasingly become targets for attorneys in recent years. Metadata has been used in thousands of cases – medical, financial, patent and trademark law, product liability, civil rights, and many more. Metadata is both discoverable and admissible as evidence. According to one New York court, “General information about the creation of a document, including who authored a document and when it was created, is pedigree information often important for purposes of determining admissibility at trial.” According to legal experts, “from a legal standpoint metadata is evidence… that describes the characteristics, origins, usage, and validity of other electronic evidence.” The biggest metadata-linked payout until now - $10.8 million – occurred in 2017, when a jury awarded a plaintiff $8 million (eventually this was increased to nearly $11 million) after claiming he was fired from a biotechnology company after telling authorities about potential bribery in China. The key piece of evidence was the metadata timestamp on a performance review that was written after the plaintiff was fired; with that evidence, the court increased the defendant's payout for violating laws against firing whistleblowers. In that case, records claiming that the employee was fired for cause were belied by the metadata in the performance review. That, of course, was a case in which there was clear wrongdoing by an organization. But the same metadata errors could have cropped up in any number of scenarios, even if no laws were broken. The precedent in this case, and others like it, might be enough to convince the court to penalize an organization based on claims of a plaintiff. How can organizations defend themselves from this legal bind of metadata The answer would seem obvious; get control of your metadata and make sure it corresponds to the data it represents. With that kind of control over data, organizations would discover for themselves if something was amiss that could cost them in a settlement later. But execution of that obvious answer is a different story. With reams of data to pore through, it would take an organization's business intelligence team months, or even years, to manually sift through the databases. And because to err is human, there would be no guarantee they hadn't missed something. Clearly Business Intelligence and Data Analysis teams need some help in doing this. One solution would be to hire more staff, expanding teams at least temporarily to make sense of the data and metadata that could prove problematic. There are services that will lend their staff to an organization to do just that, and for companies that prefer the “human touch,” adding that temporary staff may be the best solution. Another idea is to automate the process, with advanced tools that will do a full examination of data, both across systems and within repositories themselves. Such automated tools would examine the data in the various repositories and find where the metadata for the same information is different – pointing BI teams in the right direction and cutting down on the time needed to determine what needs to be fixed. Using automated metadata management tools, companies can ensure that they remain secure. If a company is being sued and discovery has commenced, it will be too late for the organization to fix anything. Honest mistakes or disorganized file keeping can no longer be corrected, and the fate of the organization will be in the hands of a jury or a judge. Automated metadata management tools can help Business Intelligence and Data Analysis teams figure out which metadata entries are not consistent across the repositories, ensuring that things are fixed before discovery takes place. There are a variety of tools on the market, with various strengths and weaknesses. Companies will need to decide whether a data dictionary, a business glossary, or a more all-encompassing product best answers their needs. They’ll also need to make sure the enterprise software they currently use is supported by the metadata management solution they are after. As the market develops, AI will be a huge distinguishing factor between metadata solutions, as machine learning will reduce the cost and manpower investment of solution onboarding significantly. With the success of recent metadata-based lawsuits, you can be sure more attorneys will be using metadata in their discovery processes. Organizations that want to defend themselves need to get their data in order, and ensure that they won't end up losing lots of money because of their own errors. Author Bio Amnon Drori is the Co-Founder and CEO of Octopai and has over 20 years of leadership experience in technology companies. Before co-founding Octopai he led sales efforts at companies like Panaya (Acquired by Infosys), Zend Technologies (Acquired by Rogue Wave Software), ModusNovo and Alvarion. Other interesting news in Tech Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data and Society report. Open AI researchers advance multi-agent competition by training AI agents in a hide and seek environment. France and Germany reaffirm blocking Facebook’s Libra cryptocurrency

0
0
27245

article-image-how-to-handle-categorical-data-for-machine-learning-algorithms

Packt Editorial Staff

20 Sep 2019

9 min read

How to handle categorical data for machine learning algorithms

Packt Editorial Staff

20 Sep 2019

9 min read

The quality of data and the amount of useful information are key factors that determine how well a machine learning algorithm can learn. Therefore, it is absolutely critical that we make sure to encode categorical variables correctly, before we feed data into a machine learning algorithm. In this article, with simple yet effective examples we will explain how to deal with categorical data in computing machine learning algorithms and how we to map ordinal and nominal feature values to integer representations. The article is an excerpt from the book Python Machine Learning - Third Edition by Sebastian Raschka and Vahid Mirjalili. This book is a comprehensive guide to machine learning and deep learning with Python. It acts as both a clear step-by-step tutorial, and a reference you’ll keep coming back to as you build your machine learning systems. It is not uncommon that real-world datasets contain one or more categorical feature columns. When we are talking about categorical data, we have to further distinguish between nominal and ordinal features. Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order XL > L > M. In contrast, nominal features don't imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue. Categorical data encoding with pandas Before we explore different techniques to handle such categorical data, let's create a new DataFrame to illustrate the problem: >>> import pandas as pd >>> df = pd.DataFrame([ ... ['green', 'M', 10.1, 'class1'], ... ['red', 'L', 13.5, 'class2'], ... ['blue', 'XL', 15.3, 'class1']]) >>> df.columns = ['color', 'size', 'price', 'classlabel'] >>> df color size price classlabel 0 green M 10.1 class1 1 red L 13.5 class2 2 blue XL 15.3 class1 As we can see in the preceding output, the newly created DataFrame contains a nominal feature (color), an ordinal feature (size), and a numerical feature (price) column. The class labels (assuming that we created a dataset for a supervised learning task) are stored in the last column. Mapping ordinal features To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Unfortunately, there is no convenient function that can automatically derive the correct order of the labels of our size feature, so we have to define the mapping manually. In the following simple example, let's assume that we know the numerical difference between features, for example, XL = L + 1 = M + 2: >>> size_mapping = { ... 'XL': 3, ... 'L': 2, ... 'M': 1} >>> df['size'] = df['size'].map(size_mapping) >>> df color size price classlabel 0 green 1 10.1 class1 1 red 2 13.5 class2 2 blue 3 15.3 class1 If we want to transform the integer values back to the original string representation at a later stage, we can simply define a reverse-mapping dictionary inv_size_mapping = {v: k for k, v in size_mapping.items()} that can then be used via the pandas map method on the transformed feature column, similar to the size_mapping dictionary that we used previously. We can use it as follows: >>> inv_size_mapping = {v: k for k, v in size_mapping.items()} >>> df['size'].map(inv_size_mapping) 0 M 1 L 2 XL Name: size, dtype: object Encoding class labels Many machine learning libraries require that class labels are encoded as integer values. Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches. To encode the class labels, we can use an approach similar to the mapping of ordinal features discussed previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer number we assign to a particular string label. Thus, we can simply enumerate the class labels, starting at 0: >>> import numpy as np >>> class_mapping = {label:idx for idx,label in ... enumerate(np.unique(df['classlabel']))} >>> class_mapping {'class1': 0, 'class2': 1} Next, we can use the mapping dictionary to transform the class labels into integers: >>> df['classlabel'] = df['classlabel'].map(class_mapping) >>> df color size price classlabel 0 green 1 10.1 0 1 red 2 13.5 1 2 blue 3 15.3 0 We can reverse the key-value pairs in the mapping dictionary as follows to map the converted class labels back to the original string representation: >>> inv_class_mapping = {v: k for k, v in class_mapping.items()} >>> df['classlabel'] = df['classlabel'].map(inv_class_mapping) >>> df color size price classlabel 0 green 1 10.1 class1 1 red 2 13.5 class2 2 blue 3 15.3 class1 Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn to achieve this: >>> from sklearn.preprocessing import LabelEncoder >>> class_le = LabelEncoder() >>> y = class_le.fit_transform(df['classlabel'].values) >>> y array([0, 1, 0]) Note that the fit_transform method is just a shortcut for calling fit and transform separately, and we can use the inverse_transform method to transform the integer class labels back into their original string representation: >>> class_le.inverse_transform(y) array(['class1', 'class2', 'class1'], dtype=object) Performing a technique ‘one-hot encoding’ on nominal features In the Mapping ordinal features section, we used a simple dictionary-mapping approach to convert the ordinal size feature into integers. Since scikit-learn's estimators for classification treat class labels as categorical data that does not imply any order (nominal), we used the convenient LabelEncoder to encode the string labels into integers. It may appear that we could use a similar approach to transform the nominal color column of our dataset, as follows: >>> X = df[['color', 'size', 'price']].values >>> color_le = LabelEncoder() >>> X[:, 0] = color_le.fit_transform(X[:, 0]) >>> X array([[1, 1, 10.1], [2, 2, 13.5], [0, 3, 15.3]], dtype=object) After executing the preceding code, the first column of the NumPy array X now holds the new color values, which are encoded as follows: blue = 0 green = 1 red = 2 If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't come in any particular order, a learning algorithm will now assume that green is larger than blue, and red is larger than green. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal. A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of an example; for example, a blue example can be encoded as blue=1, green=0, red=0. To perform this transformation, we can use the OneHotEncoder that is implemented in scikit-learn's preprocessing module: >>> from sklearn.preprocessing import OneHotEncoder >>> X = df[['color', 'size', 'price']].values >>> color_ohe = OneHotEncoder() >>> color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray() array([[0., 1., 0.], [0., 0., 1.], [1., 0., 0.]]) Note that we applied the OneHotEncoder to a single column (X[:, 0].reshape(-1, 1))) only, to avoid modifying the other two columns in the array as well. If we want to selectively transform columns in a multi-feature array, we can use the ColumnTransformer that accepts a list of (name, transformer, column(s)) tuples as follows: >>> from sklearn.compose import ColumnTransformer >>> X = df[['color', 'size', 'price']].values >>> c_transf = ColumnTransformer([ ... ('onehot', OneHotEncoder(), [0]), ... ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X) .astype(float) array([[0.0, 1.0, 0.0, 1, 10.1], [0.0, 0.0, 1.0, 2, 13.5], [1.0, 0.0, 0.0, 3, 15.3]]) In the preceding code example, we specified that we only want to modify the first column and leave the other two columns untouched via the 'passthrough' argument. An even more convenient way to create those dummy features via one-hot encoding is to use the get_dummies method implemented in pandas. Applied to a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged: >>> pd.get_dummies(df[['price', 'color', 'size']]) price size color_blue color_green color_red 0 10.1 1 0 1 0 1 13.5 2 0 0 1 2 15.3 3 1 0 0 When we are using one-hot encoding datasets, we have to keep in mind that it introduces multicollinearity, which can be an issue for certain methods (for instance, methods that require matrix inversion). If features are highly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To reduce the correlation among variables, we can simply remove one feature column from the one-hot encoded array. Note that we do not lose any important information by removing a feature column, though; for example, if we remove the column color_blue, the feature information is still preserved since if we observe color_green=0 and color_red=0, it implies that the observation must be blue. If we use the get_dummies function, we can drop the first column by passing a True argument to the drop_first parameter, as shown in the following code example: >>> pd.get_dummies(df[['price', 'color', 'size']], ... drop_first=True) price size color_green color_red 0 10.1 1 1 0 1 13.5 2 0 1 2 15.3 3 0 0 In order to drop a redundant column via the OneHotEncoder , we need to set drop='first' and set categories='auto' as follows: >>> color_ohe = OneHotEncoder(categories='auto', drop='first') >>> c_transf = ColumnTransformer([ ... ('onehot', color_ohe, [0]), ... ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X).astype(float) array([[ 1. , 0. , 1. , 10.1], [ 0. , 1. , 2. , 13.5], [ 0. , 0. , 3. , 15.3]]) In this article, we have gone through some of the methods to deal with categorical data in datasets. We distinguished between nominal and ordinal features, and with examples we explained how they can be handled. To harness the power of the latest Python open source libraries in machine learning check out this book Python Machine Learning - Third Edition, written by Sebastian Raschka and Vahid Mirjalili. Other interesting read in data! The best business intelligence tools 2019: when to use them and how much they cost Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data & Society report

0
0
48131

article-image-github-acquires-semmle-to-secure-open-source-supply-chain-attains-cve-numbering-authority-status

Savia Lobo

19 Sep 2019

5 min read

GitHub acquires Semmle to secure open-source supply chain; attains CVE Numbering Authority status

Savia Lobo

19 Sep 2019

5 min read

Yesterday, GitHub announced that it has acquired Semmle, a code analysis platform provider and also that it is now a Common Vulnerabilities and Exposures (CVE) Numbering Authority. https://twitter.com/github/status/1174371016497405953 The Semmle acquisition is a part of the plan to securing the open-source supply chain, Nat Friedman explains in his blog post. Semmle provides a code analysis engine, named QL, which allows developers to write queries that identify code patterns in large codebases and search for vulnerabilities and their variants. Security researchers use Semmle to quickly find vulnerabilities in code with simple declarative queries. “Semmle is trusted by security teams at Uber, NASA, Microsoft, Google, and has helped find thousands of vulnerabilities in some of the largest codebases in the world, as well as over 100 CVEs in open source projects to date,” Friedman writes. Also Read: GitHub now supports two-factor authentication with security keys using the WebAuthn API Semmle originally spun out of research at Oxford in 2006 announced a $21 million Series B investment led by Accel Partners, last year. “In total, the company raised $31 million before this acquisition,” Techcrunch reports. Shanku Niyogi, Senior Vice President of Product at GitHub, in his blog post writes, “An important measure of the success of Semmle’s approach is the number of vulnerabilities that have been identified and disclosed through their technology. Today, over 100 CVEs in open source projects have been found using Semmle, including high-profile projects like Apache Struts, Apple’s XNU, the Linux Kernel, Memcached, U-Boot, and VLC. No other code analysis tool has a similar success rate.” GitHub also announced that it has been approved as a CVE Numbering Authority for open source projects. Now, GitHub will be able to issue CVEs for security advisories opened on GitHub, allowing for even broader awareness across the industry. With Semmle integration, every CVE-ID can be associated with a Semmle QL query, which can then be shared and tracked by the broader developer community. The CVE approval will make it easier for project maintainers to report security flaws directly from their repositories. Also, GitHub can assign CVE identifiers directly and post them to the CVE List and the National Vulnerability Database (NVD). Earlier this year, GitHub acquired Dependabot, to provide automatic security fixes natively within GitHub. With automatic security fixes, developers no longer need to manually patch their dependencies. When a vulnerability is found in a dependency, GitHub will automatically issue a pull request on downstream repositories with the information needed to accept the patch. In August, GitHub was in the limelight for being a part of the Capital One data breach that affected 106 million users in the US and Canada. The law firm Tycko & Zavareei LLP filed a lawsuit in California’s federal district court on behalf of their plaintiffs Seth Zielicke and Aimee Aballo. Also Read: GitHub acquires Spectrum, a community-centric conversational platform Both plaintiffs claimed Capital One and GitHub were unable to protect user’s personal data. The complaint highlighted that Paige A. Thompson, the alleged hacker stole the data in March, posted about the theft on GitHub in April. According to the lawsuit, “As a result of GitHub’s failure to monitor, remove, or otherwise recognize and act upon obviously-hacked data that was displayed, disclosed, and used on or by GitHub and its website, the Personal Information sat on GitHub.com for nearly three months.” The Semmle acquisition may be GitHub’s move to improve security for users in the future. It would be interesting to know how GitHub will mold security for users with additional CVE approval. A user on Reddit writes, “I took part in a tutorial session Semmle held at a university CS society event, where we were shown how to use their system to write semantic analysis passes to look for things like use-after-free and null pointer dereferences. It was only an hour and a bit long, but I found the query language powerful & intuitive and the platform pretty effective. At the time, you could set up your codebase to run Semmle passes on pre-commit hooks or CI deployments etc. and get back some pretty smart reporting if you had introduced a bug.” The user further writes, “The session focused on Java, but a few other languages were supported as first-class, iirc. It felt kinda like writing an SQL query, but over AST rather than tuples in a table, and using modal logic to choose the selections. It took a little while to first get over the 'wut' phase (like 'how do I even express this'), but I imagine that a skilled team, once familiar with the system, could get a lot of value out of Semmle's QL/semantic analysis, especially for large/enterprise-scale codebases.” https://twitter.com/kurtseifried/status/1174395660960796672 https://twitter.com/timneutkens/status/1174598659310313472 To know more about this announcement in detail, read GitHub’s official blog post. Other news in Data Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine GQL (Graph Query Language) joins SQL as a Global Standards Project and is now the international standard declarative query language for graphs

0
0
45768

article-image-the-best-business-intelligence-tools-2019-when-to-use-them-and-how-much-they-cost

Richard Gall

19 Sep 2019

11 min read

The best business intelligence tools 2019: when to use them and how much they cost

Richard Gall

19 Sep 2019

11 min read

0
0
31335

article-image-gql-graph-query-language-joins-sql-as-a-global-standards-project-and-is-now-the-international-standard-declarative-query-language-for-graphs

Amrata Joshi

19 Sep 2019

6 min read

GQL (Graph Query Language) joins SQL as a Global Standards Project and will be the international standard declarative query language for graphs

Amrata Joshi

19 Sep 2019

6 min read

On Tuesday, the team at Neo4j, the graph database management system announced that the international committees behind the development of the SQL standard have voted to initiate GQL (Graph Query Language) as the new database query language. GQL is now going to be the international standard declarative query language for property graphs and it is also a Global Standards Project. GQL is developed and maintained by the same international group that maintains the SQL standard. How did the proposal for GQL pass? Last year in May, the initiative for GQL was first time processed in the GQL Manifesto. This year in June, the national standards bodies across the world from the ISO/IEC’s Joint Technical Committee 1 (responsible for IT standards) started voting on the GQL project proposal. The ballot closed earlier this week and the proposal was passed wherein ten countries including Germany, Korea, United States, UK, and China voted in favor. And seven countries agreed to put forward their experts to work on this project. Japan was the only country to vote against in the ballot because according to Japan, existing languages already do the job, and SQL/Property Graph Query extensions along with the rest of the SQL standard can do the same job. According to the Neo4j team, the GQL project will initiate development of next-generation technology standards for accessing data. Its charter mandates building on core foundations that are established by SQL and ongoing collaboration in order to ensure SQL and GQL interoperability and compatibility. GQL would reflect rapid growth in the graph database market by increasing adoption of the Cypher language. Stefan Plantikow, GQL project lead and editor of the planned GQL specification, said, “I believe now is the perfect time for the industry to come together and define the next generation graph query language standard.” Plantikow further added, “It’s great to receive formal recognition of the need for a standard language. Building upon a decade of experience with property graph querying, GQL will support native graph data types and structures, its own graph schema, a pattern-based approach to data querying, insertion and manipulation, and the ability to create new graphs, and graph views, as well as generate tabular and nested data. Our intent is to respect, evolve, and integrate key concepts from several existing languages including graph extensions to SQL.” Keith Hare, who has served as the chair of the international SQL standards committee for database languages since 2005, charted the progress toward GQL, said, “We have reached a balance of initiating GQL, the database query language of the future whilst preserving the value and ubiquity of SQL.” Hare further added, “Our committee has been heartened to see strong international community participation to usher in the GQL project. Such support is the mark of an emerging de jure and de facto standard .” The need for a graph-specific query language Researchers and vendors needed a graph-specific query language because of the following limitations: SQL/PGQ language is restricted to read-only queries SQL/PGQ cannot project new graphs The SQL/PGQ language can only access those graphs that are based on taking a graph view over SQL tables. Researchers and vendors needed a language like Cypher that would cover insertion and maintenance of data and not just data querying. But SQL wasn’t the apt model for a graph-centric language that takes graphs as query inputs and outputs a graph as a result. But GQL, on the other hand, builds in openCypher, a project that brings Cypher to Apache Spark and gives users a composable graph query language. SQL and GQL can work together According to most of the companies and national standards bodies that are supporting the GQL initiative, GQL and SQL are not competitors. Instead, these languages can complement each other via interoperation and shared foundations. Alastair Green, Query Languages Standards & Research Lead at Neo4j writes, “A SQL/PGQ query is in fact a SQL sub-query wrapped around a chunk of proto-GQL.” SQL is a language that is built around tables whereas GQL is built around graphs. Users can use GQL to find and project a graph from a graph. Green further writes, “I think that the SQL standards community has made the right decision here: allow SQL, a language built around tables, to quote GQL when the SQL user wants to find and project a table from a graph, but use GQL when the user wants to find and project a graph from a graph. Which means that we can produce and catalog graphs which are not just views over tables, but discrete complex data objects.” It is still not clear when will the first implementation version of GQL will be out. The official page reads, “The work of the GQL project starts in earnest at the next meeting of the SQL/GQL standards committee, ISO/IEC JTC 1 SC 32/WG3, in Arusha, Tanzania, later this month. It is impossible at this stage to say when the first implementable version of GQL will become available, but it is highly likely that some reasonably complete draft will have been created by the second half of 2020.” Developer community welcomes the new addition Users are excited to see how GQL will incorporate Cypher, a user commented on HackerNews, “It's been years since I've worked with the product and while I don't miss Neo4j, I do miss the query language. It's a little unclear to me how GQL will incorporate Cypher but I hope the initiative is successful if for no other reason than a selfish one: I'd love Cypher to be around if I ever wind up using a GraphDB again.” Few others mistook GQL to be Facebook’s GraphQL and are sceptical about the name. A comment on HackerNews reads, “Also, the name is of course justified, but it will be a mess to search for due to (Facebook) GraphQL.” A user commented, “I read the entire article and came away mistakenly thinking this was the same thing as GraphQL.” Another user commented, “That's quiet an unfortunate name clash with the existing GraphQL language in a similar domain.” Other interesting news in Data Media manipulation by Deepfakes and cheap fakes refquire both AI and social fixes, finds a Data & Society report Percona announces Percona Distribution for PostgreSQL to support open source databases Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out

0
0
29596

article-image-4-important-business-intelligence-considerations-for-the-rest-of-2019

Richard Gall

16 Sep 2019

7 min read

4 important business intelligence considerations for the rest of 2019

Richard Gall

16 Sep 2019

7 min read

0
0
37845

article-image-how-artificial-intelligence-and-machine-learning-can-help-us-tackle-the-climate-change-emergency

Vincy Davis

16 Sep 2019

14 min read

How artificial intelligence and machine learning can help us tackle the climate change emergency

Vincy Davis

16 Sep 2019

14 min read

“I don’t want you to be hopeful. I want you to panic. I want you to feel the fear I feel every day. And then I want you to act on changing the climate”- Greta Thunberg Greta Thunberg is a 16-year-old Swedish schoolgirl, who is famously called as a climate change warrior. She has started an international youth movement against climate change and has been nominated as a candidate for the Nobel Peace Prize 2019 for climate activism. According to a recent report by the Intergovernmental Panel (IPCC), climate change is seen as the top global threat by many countries. The effects of climate change is going to make 1 million species go extinct, warns a UN report. The Earth’s rising temperatures are fueling longer and hotter heat waves, more frequent droughts, heavier rainfall, and more powerful hurricanes. Antarctica is breaking. Indonesia, the world's 4th most populous country, just shifted its capital from Jakarta because it's sinking. Singapore's worried investments are moving away. Last year, Europe experienced an 'extreme year' for unusual weather events. After a couple of months of extremely cold weather, heat and drought plagued spring and summer with temperatures well above average in most of the northern and western areas. The UK Parliament has declared ‘climate change emergency’ after a series of intense protests earlier this month. More than 1,200 people were killed across South Asia due to heavy monsoon rains and intense flooding (in some places it was the worst in nearly 30 years). The CampFire, in November 2018, was the deadliest and most destructive in California’s history, causing the death of at least 85 people and destroying about 14,000 homes. Australia’s most populous state New South Wales suffered from an intense drought in 2018. According to a report released by the UN last year, there are “Only 11 Years Left to Prevent Irreversible Damage from Climate Change”. Addressing climate change: How ARTIFICIAL INTELLIGENCE (AI) can help? As seen above, environmental impacts due to climate changes are clear, the list is vast and depressing. It is important to address climate change issues as they play a key role in the workings of a natural ecosystem like change in the nature of global rainfall, diminishing ice-sheets, and other factors on which the human economy and the civilization depends on. With the help of Artificial Intelligence (AI), we can increase our probability of becoming efficient, or at least slow down the damage caused by climate change. In the recently held ICLR 2019 (International Conference on Learning Representations), Emily Shuckburgh, a Climate scientist and deputy head of the Polar Oceans team at the British Antarctic Survey highlighted the need of actionable information on climate risk. It elaborated on how we can monitor, treat and find a solution to the climate changes using machine learning. Also mentioned is, how AI can synthesize and interpolate different datasets within a framework that will allow easy interrogation by users and near-real time ingestion of new data. According to MIT tech review on climate changes, there are three approaches to address climate change: mitigation, navigation and suffering. Technologies generally concentrate on mitigation, but it’s high time that we give more focus to the other two approaches. In a catastrophically altered world, it would be necessary to concentrate on adaptation and suffering. This review states that, the mitigation steps have had almost no help in preserving fossil fuels. Thus it is important for us to learn to adapt to these changes. Building predictive models by relying on masses of data will also help in providing a better idea of how bad the effect of a disaster can be and help us to visualize the suffering. By implementing Artificial Intelligence in these approaches, it will help not only to reduce the causes but also to adapt to these climate changes. Using AI, we can predict the accurate status of climate change, which will help create better futuristic climate models. These predictions can be used to identify our biggest vulnerabilities and risk zones. This will help us to respond in a better way to the impact of climate change such as hurricanes, rising sea levels, and higher temperatures. Let’s see how Artificial Intelligence is being used in all the three approaches - Mitigation: Reducing the severity of climate change Looking at the extreme climatic changes, many researchers have started exploring how AI can step-in to reduce the effects of climate change. These include ways to reduce greenhouse gas emissions or enhance the removal of these gases from the atmosphere. In view of consuming less energy, there has been an active increase in technologies to use energy smartly. One such startup is the ‘Verv’. It is an intelligent IoT hub which uses patented AI technology to give users the authority to take control of their energy usage. This home energy system provides you with information about your home appliances and other electricity data directly from the mains, which helps to reduce your electricity bills and lower your carbon footprint. ‘Igloo Energy’ is another system which helps customers use energy efficiently and save money. It uses smart meters to analyse behavioural, property occupancy and surrounding environmental data inputs to lower the energy consumption of users. ‘Nnergix’ is a weather analytics startup focused in the renewable energy industry. It collects weather and energy data from multiple sources from the industry in order to feed machine learning based algorithms to run several analytic solutions with the main goal to help any system become more efficient during operations and reduce costs. Recently, Google announced that by using Artificial Intelligence, it’s wind energy has boosted up to 20 percent. A neural network is trained on the widely available weather forecasts and historical turbine data. The DeepMind system is configured to predict the wind power output 36 hours ahead of actual generation. The model then recommends to make hourly delivery commitments to the power grid a full day in advance, based on the predictions. Large industrial systems are the cause of 54% of global energy consumption. This high-level of energy consumption is the primary contributor to greenhouse gas emissions. In 2016, Google’s ‘DeepMind’ was able to reduce the energy required to cool Google Data Centers by 30%. Initially, the team made a general purpose learning algorithm which was developed into a full-fledged AI system with features including continuous monitoring and human override. Just last year, Google has put an AI system in charge of keeping its data centers cool. Every five minutes, AI pulls a snapshot of the data center cooling system from thousands of sensors. This data is fed into deep neural networks, which predicts how different choices will affect future energy consumption. The neural networks are trained to maintain the future PUE (Power Usage Effectiveness) and to predict the future temperature and pressure of the data centre over the next hour, to ensure that any tweaks did not take the data center beyond its operating limits. Google has found that the machine learning systems were able to consistently achieve a 30 percent reduction in the amount of energy used for cooling, the equivalent of a 15 percent reduction in overall PUE. As seen, there are many companies trying to reduce the severity of climate change. Navigation: Adapting to current conditions Though there have been brave initiatives to reduce the causes of climate change, they have failed to show any major results. This could be due to the increasing demand for energy resources, which is expected to grow immensely globally. It is now necessary to concentrate more on adapting to climate change, as we are in a state where it is almost impossible to undo its effects. Thus, it is better to learn and navigate through this climate change. A startup in Berlin, called ‘GreenAdapt’ has created a software using AI, which can tackle local impacts induced both by gradual changes and changes of extreme weather events such as storms. It identifies effects of climatic changes and proposes adequate adaptation measures. Another startup called ‘Zuli’ has a smartplug that reduces energy use. It contains sensors that can estimate energy usage, wirelessly communicate with your smartphone, and accurately sense your location. A firm called ‘Gridcure’ provides real-time analytics and insights for energy and utilities. It helps power companies recover losses and boost revenue by operating more efficiently. It also helps them provide better delivery to consumers, big reductions in energy waste, and increased adoption of clean technologies. With mitigation and navigation being pursued enough, let’s see how firms are working on futuristic goals. Visualization: Predicting the future It is also equally important to visualize accurate climate models, which will help humans to cope up with the aftereffects of climate change. Climate models are mathematical representations of the Earth's climate system, which takes into account humidity, temperature, air pressure, wind speed and direction, as well as cloud cover and predict future weather conditions. This can help in tackling disasters. It’s also imperative to fervently increase our information on global climate changes which will help to create more accurate models. A startup modeling firm called ‘Jupiter’ is trying to better the accuracy of predictions regarding climate changes. It makes physics-based and Artificial Intelligence-powered decisions using data from millions of ground-based and orbital sensors. Another firm, ‘BioCarbon Engineering’ plans to use drones which will fly over potentially suitable areas and compile 3D maps. Then, it will scatter small containers over the best areas containing fertilized seeds as well as nutrients and moisture gel. In this way, 36,000 trees can be planted every day in a way that is cheaper than other methods. After planting, drones will continue to monitor the germinating seeds and deliver further nutrients when necessary to ensure their healthy growth. This could help to absorb carbon dioxide from the atmosphere. Another initiative is by a ETH doctoral student at the Functional Materials Laboratory, who has developed a cooling curtain made of a porous triple-layer membrane as an alternative to electrically powered air conditioning. In 2017, Microsoft came up with ‘AI for Earth’ initiative, which primarily focuses on climate conservation, biodiversity, etc. AI for Earth awards grants to projects that use artificial intelligence to address critical areas that are vital for building a sustainable future. Microsoft is also using its cloud computing service Azure, to give computing resources to scientists working on environmental sustainability programs. Intel has deployed Artificial Intelligence-equipped Drones in Costa Rica to construct models of the forest terrain and calculate the amount of carbon being stored based on tree height, health, biomass, and other factors. The collected data about carbon capture can enhance management and conservation efforts, support scientific research projects on forest health and sustainability, and enable many other kinds of applications. The ‘Green Horizon Project from IBM’ analyzes environmental data and predicts pollution as well as tests scenarios that involve pollution-reducing tactics. IBM's Deep Thunder’ group works with research centers in Brazil and India to accurately predict flooding and potential mudslides due to the severe storms. As seen above, there are many organizations and companies ranging from startups to big tech who have understood the adverse effects of climate change and are taking steps to address them. However, there are certain challenges/limitations acting as a barrier for these systems to be successful. What do big tech firms and startups lack? Though many big tech and influential companies boast of immense contribution to fighting climate change, there have been instances where these firms get into lucrative deals with oil companies. Just last year, Amazon, Google and Microsoft struck deals with oil companies to provide cloud, automation, and AI services to them. These deals were published openly by Gizmodo and yet didn’t attract much criticism. This trend of powerful companies venturing into oil businesses even after knowing the effects of dangerous climate changes is depressing. Last year, Amazon quietly launched the ‘Amazon Sustainability Data Initiative’.It helps researchers store many weather observations and forecasts, satellite images and metrics about oceans, air quality so that they can be used for modeling and analysis. This encourages organizations to use the data to make decisions which will encourage sustainable development. This year, Amazon has expanded its vision by announcing ‘Shipment Zero’ to make all Amazon shipments with 50% net zero by 2030, with a wider aim to make it 100% in the future. However, Shipment Zero only commits to net carbon reductions. Recently, Amazon ordered 20,000 diesel vans whose emissions will need to be offset with carbon credits. Offsets can entail forest management policies that displace indigenous communities, and they do nothing to reduce diesel pollution which disproportionately harms communities of color. Some in the industry expressed disappointment that Amazon’s order is for 20,000 diesel vans — not a single electric vehicle. In April, Over 4,520 Amazon employees organized against Amazon’s continued profiting from climate devastation. They signed an open letter addressed to Jeff Bezos and Amazon board of directors asking for a company-wide action plan to address climate change and an end to the company’s reliance on dirty energy resources. Recently, Microsoft doubled its internal carbon fee to $15 per metric ton on all carbon emissions. The funds from this higher fee will maintain Microsoft’s carbon neutrality and help meet their sustainability goals. On the other hand, Microsoft is also two years into a seven-year deal—rumored to be worth over a billion dollars—to help Chevron, one of the world’s largest oil companies, better extract and distribute oil. Microsoft Azure has also partnered with Equinor, a multinational energy company to provide data services in a deal worth hundreds of millions of dollars. Instead of gaining profit from these deals, Microsoft could have taken a stand by ending partnerships with these fossil fuel companies which accelerate oil and gas exploration and extraction. With respect to smaller firms, often it is difficult for a climate-focused conservative startup to survive due to the dearth of finance. Many such organizations are small and relatively weak as they struggle to rise in a sector with little apathy and lack of steady financing. Also startups being non-famous, it is difficult for them to market their ideas and convince people to try their systems. They always need a commercial boost to find more takers. Pitfalls of using Artificial Intelligence for climate preservation Though AI has enormous potential to help us create a sustainable future, it is only part of a bigger set of tools and pathways needed to reach the goal. It also comes with its own limitations and side effects. An inability to control malicious AI can cause unexpected outcomes. Hackers can use AI to develop smart malware that interfere with early warnings, enable bad actors to control energy, transportation or other critical systems and could also get them access to sensitive data. This could result in unexpected outcomes at crucial output points for AI systems. AI bias, is another dangerous phenomena, that can give an irrational result to a working system. Bias in an AI system mainly occurs in the data or in the system’s algorithmic model which may produce incorrect results in its functions and security. [dropcap]M[/dropcap]ore importantly, we should not rely on Artificial Intelligence alone to fight the effects of climate change. Our focus should be to work on the causes of climate change and try to minimize it, from an individual level. Even governments in every country must contribute, by initiating “climate policies” which will help its citizens in the long run. One vital task would be to implement quick responses in case of climate emergencies. Like the recent case of Odisha storms, the pinpoint accuracy by the Indian weather association helped to move millions of people to safe spaces, resulting in minimum casualties. Next up in Climate Amazon employees plan to walkout for climate change during the Sept 20th Global Climate Strike Machine learning experts on how we can use machine learning to mitigate and adapt to the changing climate Now there’s a CycleGAN to visualize the effects of climate change. But is this enough to mobilize action?

0
0
46238

article-image-the-cap-theorem-in-practice-the-consistency-vs-availability-trade-off-in-distributed-databases

Richard Gall

12 Sep 2019

7 min read

The CAP Theorem in practice: The consistency vs. availability trade-off in distributed databases

Richard Gall

12 Sep 2019

7 min read

When you choose a database you are making a design decision. One of the best frameworks for understanding what this means in practice is the CAP Theorem. What is the CAP Theorem? The CAP Theorem, developed by computer scientist Eric Brewer in the late nineties, states that databases can only ever fulfil two out of three elements: Consistency - that reads are always up to date, which means any client making a request to the database will get the same view of data. Availability - database requests always receive a response (when valid). Partition tolerance - that a network fault doesn’t prevent messaging between nodes. In the context of distributed (NoSQL) databases, this means there is always going to be a trade-off between consistency and availability. This is because distributed systems are always necessarily partition tolerant (ie. it simply wouldn’t be a distributed database if it wasn’t partition tolerant.) Read next: Different types of NoSQL databases and when to use them How do you use the CAP Theorem when making database decisions? Although the CAP Theorem can feel quite abstract, it has practical, real-world consequences. From both a technical and business perspective the trade-offs will lead you to some very important questions. There are no right answers. Ultimately it will be all about the context in which your database is operating, the needs of the business, and the expectations and needs of users. You will have to consider things like: Is it important to avoid throwing up errors in the client? Or are we willing to sacrifice the visible user experience to ensure consistency? Is consistency an actual important part of the user’s experience Or can we actually do what we want with a relational database and avoid the need for partition tolerance altogether? As you can see, these are ultimately user experience questions. To properly understand those, you need to be sensitive to the overall goals of the project, and, as said above, the context in which your database solution is operating. (Eg. Is it powering an internal analytics dashboard? Or is it supporting a widely used external-facing website or application?) And, as the final bullet point highlights, it’s always worth considering whether the consistency v availability trade-off should matter at all. Avoid the temptation to think a complex database solution will always be better when a simple, more traditional solution will do the job. Of course, it’s important to note that systems that aren’t partition tolerant are a single point of failure in a system. That introduces the potential for unreliability. Prioritizing consistency in a distributed database It’s possible to get into a lot of technical detail when talking about consistency and availability, but at a really fundamental level the principle is straightforward: you need consistency (or what is called a CP database) if the data in the database must always be up to date and aligned, even in the instance of a network failure (eg. the partitioned nodes are unable to communicate with one another for whatever reason). Particular use cases where you would prioritize consistency is when you need multiple clients to have the same view of the data. For example, where you’re dealing with financial information, personal information, using a database that gives you consistency and confidence that data you are looking at is up to date in a situation where the network is unreliable or fails. Examples of CP databases MongoDB Learning MongoDB 4 [Video] MongoDB 4 Quick Start Guide MongoDB, Express, Angular, and Node.js Fundamentals Redis Build Complex Express Sites with Redis and Socket.io [Video] Learning Redis HBase Learn by Example : HBase - The Hadoop Database [Video] HBase Design Patterns Prioritizing availability in a distributed database Availability is essential when data accumulation is a priority. Think here of things like behavioral data or user preferences. In scenarios like these, you will want to capture as much information as possible about what a user or customer is doing, but it isn’t critical that the database is constantly up to date. It simply just needs to be accessible and available even when network connections aren’t working. The growing demand for offline application use is also one reason why you might use a NoSQL database that prioritizes availability over consistency. Examples of AP databases Cassandra Learn Apache Cassandra in Just 2 Hours [Video] Mastering Apache Cassandra 3.x - Third Edition DynamoDB Managed NoSQL Database In The Cloud - Amazon AWS DynamoDB [Video] Hands-On Amazon DynamoDB for Developers [Video] Limitations and criticisms of CAP Theorem It’s worth noting that the CAP Theorem can pose problems. As with most things, in truth, things are a little more complicated. Even Eric Brewer is circumspect about the theorem, especially as what we expect from distributed databases. Back in 2012, twelve years after he first put his theorem into the world, he wrote that: “Although designers still need to choose between consistency and availability when partitions are present, there is an incredible range of flexibility for handling partitions and recovering from them. The modern CAP goal should be to maximize combinations of consistency and availability that make sense for the specific application. Such an approach incorporates plans for operation during a partition and for recovery afterward, thus helping designers think about CAP beyond its historically perceived limitations.” So, this means we must think about the trade-off between consistency and availability as a balancing act, rather than a binary design decision. Elsewhere, there have been more robust criticisms of CAP Theorem. Software engineer Martin Kleppmann, for example, pleaded Please stop calling databases CP or AP in 2015. In a blog post he argues that CAP Theorem only works if you adhere to specific definitions of consistency, availability, and partition tolerance. “If your use of words matches the precise definitions of the proof, then the CAP theorem applies to you," he writes. “But if you’re using some other notion of consistency or availability, you can’t expect the CAP theorem to still apply.” The consequences of this are much like those described in Brewer’s piece from 2012. You need to take a nuanced approach to database trade-offs in which you think them through on your own terms and up against your own needs. The PACELC Theorem One of the developments of this line of argument is an extension to the CAP Theorem: the PACELC Theorem. This moves beyond thinking about consistency and availability and instead places an emphasis on the trade-off between consistency and latency. The PACELC Theorem builds on the CAP Theorem (the ‘PAC’) and adds an else (the ‘E’). What this means is that while you need to choose between availability and consistency if communication between partitions has failed in a distributed system, even if things are running properly and there are no network issues, there is still going to be a trade-off between consistency and latency (the ‘LC’). Conclusion: Learn to align context with technical specs Although the CAP Theorem might seem somewhat outdated, it is valuable in providing a way to think about database architecture design. It not only forces engineers and architects to ask questions about what they want from the technologies they use, but it also forces them to think carefully about the requirements of a given project. What are the business goals? What are user expectations? The PACELC Theorem builds on CAP in an effective way. However, the most important thing about these frameworks is how they help you to think about your problems. Of course the CAP Theorem has limitations. Because it abstracts a problem it is necessarily going to lack nuance. There are going to be things it simplifies. It’s important, as Kleppmann reminds us - to be mindful of these nuances. But at the same time, we shouldn’t let an obsession with nuance and detail allow us to miss the bigger picture.

0
0
60879

article-image-different-types-of-nosql-databases-and-when-to-use-them

Richard Gall

10 Sep 2019

8 min read

Different types of NoSQL databases and when to use them

Richard Gall

10 Sep 2019

8 min read

Why NoSQL databases? The popularity of NoSQL databases over the last decade or so has been driven by an explosion of data. Before what’s commonly described as ‘the big data revolution’, relational databases were the norm - these are databases that contain structured data. Structured data can only be structured if it is based on an existing schema that defines the relationships (hence relational) between the data inside the database. However, with the vast quantities of data that are now available to just about every business with an internet connection, relational databases simply aren’t equipped to handle the complexity and scale of large datasets. Why not SQL databases? This is for a couple of reasons. The defined schemas that are a necessary component of every relational database will not only undermine the richness and integrity of the data you’re working with, relational databases are also hard to scale. Relational databases can only scale vertically, not horizontally. That’s fine to a certain extent, but when you start getting high volumes of data - such as when millions of people use a web application, for example - things get really slow and you need more processing power. You can do this by upgrading your hardware, but that isn’t really sustainable. By scaling out, as you can with NoSQL databases, you can use a distributed network of computers to handle data. That gives you more speed and more flexibility. This isn’t to say that relational and SQL databases have had their day. They still fulfil many use cases. The only difference is that NoSQL can offers a level of far greater power and control for data intensive use cases. Indeed, using a NoSQL database when SQL will do is only going to add more complexity to something that just doesn’t need it. Seven NoSQL Databases in a Week Different types of NoSQL databases and when to use them So, now we’ve looked at why NoSQL databases have grown in popularity in recent years, lets dig into some of the different options available. There are a huge number of NoSQL databases out there - some of them open source, some premium products - many of them built for very different purposes. Broadly speaking there are 4 different models of NoSQL databases: Key-Value pair-based databases Column-based databases Document-oriented databases Graph databases Let’s take a look at these four models, how they’re different from one another, and some examples of the product options in each. Key Value pair-based NoSQL database management systems Key/Value pair based NoSQL databases store data in, as you might expect, pairs of keys and values. Data is stored with a matching key - keys have no relation or structure (so, keys could be height, age, hair color, for example). When should you use a key/value pair-based NoSQL DBMS? Key/value pair based NoSQL databases are the most basic type of NoSQL database. They’re useful for storing fairly basic information, like details about a customer. Which key/value pair-based DBMS should you use? There are a number of different key/value pair databases. The most popular is Redis. Redis is incredibly fast and very flexible in terms of the languages and tools it can be used with. It can be used for a wide variety of purposes - one of the reasons high-profile organizations use it, including Verizon, Atlassian, and Samsung. It’s also open source with enterprise options available for users with significant requirements. Redis 4.x Cookbook Other than Redis, other options include Memcached and Ehcache. As well as those, there are a number of other multi-model options (which will crop up later, no doubt) such as Amazon DynamoDB, Microsoft’s Cosmos DB, and OrientDB. Hands-On Amazon DynamoDB for Developers [Video] RDS PostgreSQL and DynamoDB CRUD: AWS with Python and Boto3 [Video] Column-based NoSQL database management systems Column-based databases separate data into discrete columns. Instead of using rows - whereby the row ID is the main key - column-based database systems flip things around to make the data the main key. By using columns you can gain much greater speed when querying data. Although it’s true that querying a whole row of data would take longer in a column-based DBMS, the use cases for column based databases mean you probably won’t be doing this. Instead you’ll be querying a specific part of the data rather than the whole row. When should you use a column-based NoSQL DBMS? Column-based systems are most appropriate for big data and instances where data is relatively simple and consistent (they don’t particularly handle volatility that well). Which column-based NoSQL DBMS should you use? The most popular column-based DBMS is Cassandra. The software prizes itself on its performance, boasting 100% availability thanks to lacking a single point of failure, and offering impressive scalability at a good price. Cassandra’s popularity speaks for itself - Cassandra is used by 40% of the Fortune 100. Mastering Apache Cassandra 3.x - Third Edition Learn Apache Cassandra in Just 2 Hours [Video] There are other options available, such as HBase and Cosmos DB. HBase High Performance Cookbook Document-oriented NoSQL database management systems Document-oriented NoSQL systems are very similar to key/value pair database management systems. The only difference is that the value that is paired with a key is stored as a document. Each document is self-contained, which means no schema is required - giving a significant degree of flexibility over the data you have. For software developers, this is essential - it’s for this reason that document-oriented databases such as MongoDB and CouchDB are useful components of the full-stack development tool chain. Some search platforms such as ElasticSearch use mechanisms similar to standard document-oriented systems - so they could be considered part of the same family of database management systems. When should you use a document-oriented DBMS? Document-oriented databases can help power many different types of websites and applications - from stores to content systems. However, the flexibility of document-oriented systems means they are not built for complex queries. Which document-oriented DBMS should you use? The leader in this space is, MongoDB. With an amazing 40 million downloads (and apparently 30,000 more every single day), it’s clear that MongoDB is a cornerstone of the NoSQL database revolution. MongoDB 4 Quick Start Guide MongoDB Administrator's Guide MongoDB Cookbook - Second Edition There are other options as well as MongoDB - these include CouchDB, CouchBase, DynamoDB and Cosmos DB. Learning Azure Cosmos DB Guide to NoSQL with Azure Cosmos DB Graph-based NoSQL database management systems The final type of NoSQL database is graph-based. The notable distinction about graph-based NoSQL databases is that they contain the relationships between different data. Subsequently, graph databases look quite different to any of the other databases above - they store data as nodes, with the ‘edges’ of the nodes describing their relationship to other nodes. Graph databases, compared to relational databases, are multidimensional in nature. They display not just basic relationships between tables and data, but more complex and multifaceted ones. When should you use a graph database? Because graph databases contain the relationships between a set of data (customers, products, price etc.) they can be used to build and model networks. This makes graph databases extremely useful for applications ranging from fraud detection to smart homes to search. Which graph database should you use? The world’s most popular graph database is Neo4j. It’s purpose built for data sets that contain strong relationships and connections. Widely used in the industry in companies such as eBay and Walmart, it has established its reputation as one of the world’s best NoSQL database products. Back in 2015 Packt’s Data Scientist demonstrated how he used Neo4j to build a graph application. Read more. Learning Neo4j 3.x [Video] Exploring Graph Algorithms with Neo4j [Video] NoSQL databases are the future - but know when to use the right one for the job Although NoSQL databases will remain a fixture in the engineering world, SQL databases will always be around. This is an important point - when it comes to databases, using the right tool for the job is essential. It’s a valuable exercise to explore a range of options and get to know how they work - sometimes the difference might just be a personal preference about usability. And that’s fine - you need to be productive after all. But what’s ultimately most essential is having a clear sense of what you’re trying to accomplish, and choosing the database based on your fundamental needs.

0
0
63498

article-image-what-can-you-expect-at-neurips-2019

Sugandha Lahoti

06 Sep 2019

5 min read

What can you expect at NeurIPS 2019?

Sugandha Lahoti

06 Sep 2019

5 min read

0
0
26494

article-image-google-is-circumventing-gdpr-reveals-braves-investigation-for-the-authorized-buyers-ad-business-case

Bhagyashree R

06 Sep 2019

6 min read

Google is circumventing GDPR, reveals Brave's investigation for the Authorized Buyers ad business case

Bhagyashree R

06 Sep 2019

6 min read

Last year, Dr. Johnny Ryan, the Chief Policy & Industry Relations Officer at Brave, filed a complaint against Google’s DoubleClick/Authorized Buyers ad business with the Irish Data Protection Commission (DPC). New evidence produced by Brave reveals that Google is circumventing GDPR and also undermining its own data protection measures. Brave calls Google’s Push Pages a GDPR workaround Brave’s new evidence rebuts some of Google’s claims regarding its DoubleClick/Authorized Buyers system, the world’s largest real-time advertising auction house. Google says that it prohibits companies that use its real-time bidding (RTB) ad system “from joining data they receive from the Cookie Matching Service.” In September last year, Google announced that it has removed encrypted cookie IDs and list names from bid requests with buyers in its Authorized Buyers marketplace. Brave’s research, however, found otherwise, “Brave’s new evidence reveals that Google allowed not only one additional party, but many, to match with Google identifiers. The evidence further reveals that Google allowed multiple parties to match their identifiers for the data subject with each other.” When you visit a website that has Google ads embedded on its web pages, Google will run a real-time bidding ad auction to determine which advertiser will get to display its ads. For this, it uses Push Pages, which is the mechanism in question here. Brave hired Zach Edwards, the co-founder of digital analytics startup Victory Medium, and MetaX, a company that audits data supply chains, to investigate and analyze a log of Dr. Ryan’s web browsing. The research revealed that Google's Push Pages can essentially be used as a workaround for user IDs. Google shares a ‘google_push’ identifier with the participating companies to identify a user. Brave says that the problem here is that the identifier that was shared was common to multiple companies. This means that these companies could have cross-referenced what they learned about the user from Google with each other. Used by more than 8.4 million websites, Google's DoubleClick/Authorized Buyers broadcasts personal data of users to 2000+ companies. This data includes the category of what a user is reading, which can reveal their political views, sexual orientation, religious beliefs, as well as their locations. There are also unique ID codes that are specific to a user that can let companies uniquely identify a user. All this information can give these companies a way to keep tabs on what users are “reading, watching, and listening to online.” Brave calls Google’s RTB data protection policies “weak” as they ask these companies to self-regulate. Google does not have much control over what these companies do with the data once broadcast. “Its policy requires only that the thousands of companies that Google shares peoples’ sensitive data with monitor their own compliance, and judge for themselves what they should do,” Brave wrote. A Google spokesperson, as a response to this news, told Forbes, “We do not serve personalised ads or send bid requests to bidders without user consent. The Irish DPC — as Google's lead DPA — and the UK ICO are already looking into real-time bidding in order to assess its compliance with GDPR. We welcome that work and are co-operating in full." Users recommend starting an “information campaign” instead of a penalty that will hardly affect the big tech This news triggered a discussion on Hacker News where users talked about the implications of RTB and what strict actions the EU can take to protect user privacy. A user explained, "So, let's say you're an online retailer, and you have Google IDs for your customers. You probably have some useful and sensitive customer information, like names, emails, addresses, and purchase histories. In order to better target your ads, you could participate in one of these exchanges, so that you can use the information you receive to suggest products that are as relevant as possible to each customer. To participate, you send all this sensitive information, along with a Google ID, and receive similar information from other retailers, online services, video games, banks, credit card providers, insurers, mortgage brokers, service providers, and more! And now you know what sort of vehicles your customers drive, how much they make, whether they're married, how many kids they have, which websites they browse, etc. So useful! And not only do you get all these juicy private details, but you've also shared your customers sensitive purchase history with anyone else who is connected to the exchange." Others said that a penalty is not going to deter Google. "The whole penalty system is quite silly. The fines destroy small companies who are the ones struggling to comply, and do little more than offer extremely gentle pokes on the wrist for megacorps that have relatively unlimited resources available for complete compliance, if they actually wanted to comply." Users suggested that the EU should instead start an information campaign. "EU should ignore the fines this time and start an "information campaign" regarding behavior of Google and others. I bet that hurts Google 10 times more." Some also said that not just Google but the RTB participants should also be held responsible. "Because what Google is doing is not dissimilar to how any other RTB participant is acting, saying this is a Google workaround seems disingenuous." With this case, Brave has launched a full-fledged campaign that aims to “reform the multi-billion dollar RTB industry spans sixteen EU countries.” To achieve this goal it has collaborated with several privacy NGOs and academics including the Open Rights Group, Dr. Michael Veale of the Turing Institute, among others. In other news, a Bloomberg report reveals that Google and other internet companies have recently asked for an amendment to the California Consumer Privacy Act, which will be enacted in 2020. The law currently limits how digital advertising companies collect and make money from user data. The amendments proposed include approval for collecting user data for targeted advertising, using the collected data from websites for their own analysis, and many others. Read the Bloomberg report to know more in detail. Other news in Data Facebook content moderators work in filthy, stressful conditions and experience emotional trauma daily, reports The Verge GDPR complaint in EU claim billions of personal data leaked via online advertising bids European Union fined Google 1.49 billion euros for antitrust violations in online advertising

0
0
44217

How-To Tutorials - Data

TensorFlow 2.0 released with tighter Keras integration, eager execution enabled by default, and more!

Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages

Can a modified MIT ‘Hippocratic License’ to restrict misuse of open source software prompt a wave of ethical innovation in tech?

Machine learning ethics: what you need to know and what you can do

Bad Metadata can get you in legal hot water

How to handle categorical data for machine learning algorithms

GitHub acquires Semmle to secure open-source supply chain; attains CVE Numbering Authority status

The best business intelligence tools 2019: when to use them and how much they cost

GQL (Graph Query Language) joins SQL as a Global Standards Project and will be the international standard declarative query language for graphs

4 important business intelligence considerations for the rest of 2019

Trending Topics

How artificial intelligence and machine learning can help us tackle the climate change emergency

The CAP Theorem in practice: The consistency vs. availability trade-off in distributed databases

Different types of NoSQL databases and when to use them

What can you expect at NeurIPS 2019?

Google is circumventing GDPR, reveals Brave's investigation for the Authorized Buyers ad business case

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access