Data | Tech News, Tutorials & Expert Insights

article-image-building-recommendation-system-azure

19 Feb 2016

7 min read

Building A Recommendation System with Azure

19 Feb 2016

Recommender systems are common these days. You may not have noticed, but you might already be a user or receiver of such a system somewhere. Most of the well-performing e-commerce platforms use recommendation systems to recommend items to their users. When you see on the Amazon website that a book is recommended to you based on your earlier preferences, purchases, and browse history, Amazon is actually using such a recommendation system. Similarly, Netflix uses its recommendation system to suggest movies for you. (For more resources related to this topic, see here.) A recommender or recommendation system is used to recommend a product or information often based on user characteristics, preferences, history, and so on. So, a recommendation is always personalized. Until recently, it was not so easy or straightforward to build a recommender, but Azure ML makes it really easy to build one as long as you have your data ready. This article introduces you to the concept of recommendation systems and also the model available in ML Studio for you to build your own recommender system. It then walks you through the process of building a recommendation system with a simple example. The Matchbox recommender Microsoft has developed a large-scale recommender system based on a probabilistic model (Bayesian) called Matchbox. This model can learn about a user's preferences through observations made on how they rate items, such as movies, content, or other products. Based on those observations, it recommends new items to the users when requested. Matchbox uses the available data for each user in the most efficient way possible. The learning algorithm it uses is designed specifically for big data. However, its main feature is that Matchbox takes advantage of metadata available for both users and items. This means that the things it learns about one user or item can be transferred across to other users or items. You can find more information about the Matchbox model at the Microsoft Research project link. Kinds of recommendations The Matchbox recommender supports the building of four kinds of recommenders, which will include most of the scenarios. Let's take a look at the following list: Rating Prediction: This predicts ratings for a given user and item, for example, if a new movie is released, the system will predict what will be your rating for that movie out of 1-5. Item Recommendation: This recommends items to a given user, for example, Amazon suggests you books or YouTube suggests you videos to watch on its home page (especially when you are logged in). Related Users: This finds users that are related to a given user, for example, LinkedIn suggests people that you can get connected to or Facebook suggests friends to you. Related Items: This finds the items related to a given item, for example, a blog site suggests you related posts when you are reading a blog post. Understanding the recommender modules The Matchbox recommender comes with three components; as you might have guessed, a module each to train, score, and evaluate the data. The modules are described as follows. The train Matchbox recommender This module contains the algorithm and generates the trained algorithm, as shown in the following screenshot: This module takes the values for the following two parameters. The number of traits This value decides how many implicit features (traits) the algorithm will learn about that are related to every user and item. The higher this value, the precise it would be as it would lead to better prediction. Typically, it takes a value in the range of 2 to 20. The number of recommendation algorithm iterations It is the number of times the algorithm iterates over the data. The higher this value, the better would the predictions be. Typically, it takes a value in the range of 1 to 10. The score matchbox recommender This module lets you specify the kind of recommendation and corresponding parameters you want: Rating Prediction Item Prediction Related Users Related Items Let's take a look at the following screenshot: The ML Studio help page for the module provides details of all the corresponding parameters. The evaluate recommender This module takes a test and a scored dataset and generates evaluation metrics, as shown in the following screenshot: It also lets you specify the kind of recommendation, such as the score module and corresponding parameters. Building a recommendation system Now, it would be worthwhile that you learn to build one by yourself. We will build a simple recommender system to recommend restaurants to a given user. ML Studio includes three sample datasets, described as follows: Restaurant customer data: This is a set of metadata about customers, including demographics and preferences, for example, latitude, longitude, interest, and personality. Restaurant feature data: This is a set of metadata about restaurants and their features, such as food type, dining style, and location, for example, placeID, latitude, longitude, price. Restaurant ratings: This contains the ratings given by users to restaurants on a scale of 0 to 2. It contains the columns: userID, placeID, and rating. Now, we will build a recommender that will recommend a given number of restaurants to a user (userID). To build a recommender perform the following steps: Create a new experiment. In the Search box in the modules palette, type Restaurant. The preceding three datasets get listed. Drag them all to the canvas one after another. Drag a Split module and connect it to the output port of the Restaurant ratings module. On the properties section to the right, choose Splitting mode as Recommender Split. Leave the other parameters at their default values. Drag a Project Columns module to the canvas and select the columns: userID, latitude, longitude, interest, and personality. Similarly, drag another Project Columns module and connect it to the Restaurant feature data module and select the columns: placeID, latitude, longitude, price, the_geom_meter, and address, zip. Drag a Train Matchbox Recommender module to the canvas and make connections to the three input ports, as shown in the following screenshot: Drag a Score Matchbox Recommender module to the canvas and make connections to the three input ports and set the property's values, as shown in the following screenshot: Run the experiment and when it gets completed, right-click on the output of the Score Matchbox Recommender module and click on Visualize to explore the scored data. You can note the different restaurants (IDs) recommended as items for a user from the test dataset. The next step is to evaluate the scored prediction. Drag the Evaluate Recommender module to the canvas and connect the second output of the Split module to its first input port and connect the output of the Score Matchbox Recommender module to its second input. Leave the module at its default properties. Run the experiment again and when finished, right-click on the output port of the Evaluate Recommender module and click on Visualize to find the evaluation metric. The evaluation metric Normalized Discounted Cumulative Gain (NDCG) is estimated from the ground truth ratings given in the test set. Its value ranges from 0.0 to 1.0, where 1.0 represents the most ideal ranking of the entities. Summary You started with gaining the basic knowledge about a recommender system. You then understood the Matchbox recommender that comes with ML Studio along with its components. You also explored different kinds of recommendations that you can make with it. Finally, you ended up building a simple recommendation system to recommend restaurants to a given user. For more information on Azure, take a look at the following books also by Packt Publishing: Learning Microsoft Azure (https://www.packtpub.com/networking-and-servers/learning-microsoft-azure) Microsoft Windows Azure Development Cookbook (https://www.packtpub.com/application-development/microsoft-windows-azure-development-cookbook) Resources for Article: Further resources on this subject: Introduction to Microsoft Azure Cloud Services[article] Microsoft Azure – Developing Web API for Mobile Apps[article] Security in Microsoft Azure[article]

0
0
19211

article-image-ai-now-institute-publishes-a-report-on-the-diversity-crisis-in-ai-and-offers-12-solutions-to-fix-it

Bhagyashree R

22 Apr 2019

7 min read

AI Now Institute publishes a report on the diversity crisis in AI and offers 12 solutions to fix it

Bhagyashree R

22 Apr 2019

7 min read

Earlier this month, the AI Now Institute published a report, authored by Sarah Myers West, Meredith Whittaker, and Kate Crawford, highlighting the link between the diversity issue in the current AI industry and the discriminating behavior of AI systems. The report further recommends some solutions to these problems that companies and the researchers behind these systems need to adopt to address these issues. Sarah Myers West is a postdoc researcher at the AI Now Institute and an affiliate researcher at the Berkman-Klein Center for Internet and Society. Meredith Whittaker is the co-founder of the AI Now Institute and leads Google's Open Research Group and the Google Measurement Lab. Kate Crawford is a Principal Researcher at Microsoft Research and the co-founder and Director of Research at the AI Now Institute. Kate Crawford tweeted about this study. https://twitter.com/katecrawford/status/1118509988392112128 The AI industry lacks diversity, gender neutrality, and bias-free systems In recent years, we have come across several cases of “discriminating systems”. Facial recognition systems miscategorize black people and sometimes fails to work for trans drivers. When trained in online discourse, chatbots easily learn racist and misogynistic language. This type of behavior by machines is actually a reflection of society. “In most cases, such bias mirrors and replicates existing structures of inequality in the society,” says the report. The study also sheds light on gender bias in the current workforce. According to the report, only 18% of authors at some of the biggest AI conferences are women. On the other side of the spectrum are men who cover 80%. The tech giants, Facebook and Google, have a meager 15% and 10% women as their AI research staff. The situation for black workers in the AI industry looks even worse. While Facebook and Microsoft have 4% of their current workforce as black workers, Google stands at just 2.5%. Also, vast majority of AI studies assume gender is binary, and commonly assigns people as ‘male’ or ‘female’ based on physical appearance and stereotypical assumptions, erasing all other forms of gender identity. The report further reveals that, though there have been various “pipeline studies” to check the flow of diverse job candidates, they have failed to show substantial progress in bringing diversity in the AI industry. “The focus on the pipeline has not addressed deeper issues with workplace cultures, power asymmetries, harassment, exclusionary hiring practices, unfair compensation, and tokenization that are causing people to leave or avoid working in the AI sector altogether,” the report reads. What steps can industries take to address bias and discrimination in AI Systems The report lists 12 recommendations that AI researchers and companies should employ to improve workplace diversity and address bias and discrimination in AI systems. Publish compensation levels, including bonuses and equity, across all roles and job categories, broken down by race and gender. End pay and opportunity inequality, and set pay and benefit equity goals that include contract workers, temps, and vendors. Publish harassment and discrimination transparency reports, including the number of claims over time, the types of claims submitted, and actions taken. Change hiring practices to maximize diversity: include targeted recruitment beyond elite universities, ensure more equitable focus on under-represented groups, and create more pathways for contractors, temps, and vendors to become full-time employees. Commit to transparency around hiring practices, especially regarding how candidates are leveled, compensated, and promoted. Increase the number of people of color, women and other under-represented groups at senior leadership levels of AI companies across all departments. Ensure executive incentive structures are tied to increases in hiring and retention of underrepresented groups. For academic workplaces, ensure greater diversity in all spaces where AI research is conducted, including AI-related departments and conference committees. Remedying bias in AI systems is almost impossible when these systems are opaque. Transparency is essential, and begins with tracking and publicizing where AI systems are used, and for what purpose. Rigorous testing should be required across the lifecycle of AI systems in sensitive domains. Pre-release trials, independent auditing, and ongoing monitoring are necessary to test for bias, discrimination, and other harms. The field of research on bias and fairness needs to go beyond technical debiasing to include a wider social analysis of how AI is used in context. This necessitates including a wider range of disciplinary expertise. The methods for addressing bias and discrimination in AI need to expand to include assessments of whether certain systems should be designed at all, based on a thorough risk assessment. AI-related departments and conference committees. Credits: AI Now Institute Bringing diversity in the AI workforce In order to address the diversity issue in the AI industry, companies need to make changes in the current hiring practices. They should have a more equitable focus on under-represented groups. People of color, women, and other under-represented groups should get fair chance to get into senior leadership level of AI companies across all departments. Further opportunities should be created for contractors, temps, and vendors to become full-time employees. To bridge the gender pay gap in the AI industry, it is important that companies maintain transparency regarding the compensation levels, including bonuses and equity, regardless of gender or race. In the past few years, several cases of sexual misconducts involving some of the biggest companies like Google, Microsoft, have come into light because of movements like #MeToo, Google Walkout, and more. These movements gave the victims and other supporting employees the courage to speak against employees at higher positions who were taking undue advantage of their power. There are cases were the sexual harassment complaints were not taken seriously by the HRs and victims were told to just “get over it”. This is why, companies should publish harassment and discrimination transparency reports that include information like the number and types of claims made and the actions taken by the company. Academic workplaces should ensure diversity in all AI-related departments and conference committees. In the past, some of the biggest AI conferences like Neural Information Processing Systems conference has failed to provide a welcoming and safer environment for women. In a survey conducted last year, many respondents shared that they have experienced sexual harassment. Women reported persistent advances from men at the conference. The organizers of such conferences should ensure an inclusive and welcoming environment for everyone. Addressing bias and discrimination in AI systems To address bias and discrimination in AI systems, the report recommends to do rigorous testing across the lifecycle of these systems. These systems should have pre-release trials, independent auditing, and monitoring to check bias, discrimination, and other harms. Looking at the social implications of AI systems, just addressing the algorithmic bias is not enough. “The field of research on bias and fairness needs to go beyond technical debiasing to include a wider social analysis of how AI is used in context. This necessitates including a wider range of disciplinary expertise,” says the report. While assessing a AI system, researchers and developers should also check whether designing a certain system is required at all, considering the risks it poses. The study calls for re-evaluating the current AI systems used for classifying, detecting, and predicting the race and gender. The idea of identifying a race or gender just by appearance is flawed and can be easily abused. Especially, systems that use physical appearance to find interior states, for instance, those that claim to detect sexuality from headshots. These systems are urgently in need to be checked. To know more in detail, read the full report: Discriminating Systems. Microsoft’s #MeToo reckoning: female employees speak out against workplace harassment and discrimination Desmond U. Patton, Director of SAFElab shares why AI systems should be a product of interdisciplinary research and diverse teams Google’s Chief Diversity Officer, Danielle Brown resigns to join HR tech firm Gusto

0
0
19011

article-image-connecting-data-mongodb-using-pymongo-php

Amey Varangaonkar

16 Mar 2018

6 min read

Connecting your data to MongoDB using PyMongo and PHP

Amey Varangaonkar

16 Mar 2018

6 min read

[box type="note" align="" class="" width=""]The following book excerpt is taken from the title Mastering MongoDB 3.x written by Alex Giamas. This book covers the fundamental as well as advanced tasks related to working with MongoDB.[/box] There are two ways to connect your data to MongoDB. The first is using the driver for your programming language. The second is by using an ODM (Object Document Mapper) layer to map your model objects to MongoDB in a transparent way. In this post, we will cover both methods using two of the most popular languages for web application development: Python, using the official MongoDB low level driver, PyMongo, and PHP, using the official PHP driver for MongoDB Connect using Python Installing PyMongo can be done using pip or easy_install: python -m pip install pymongo python -m easy_install pymongo Then in our class we can connect to a database: >>> from pymongo import MongoClient >>> client = MongoClient() Connecting to a replica set needs a set of seed servers for the client to find out what the primary, secondary, or arbiter nodes in the set are: client = pymongo.MongoClient('mongodb://user:passwd@node1:p1,node2:p2/?replicaSet=rs name') Using the connection string URL we can pass a username/password and replicaSet name all in a single string. Some of the most interesting options for the connection string URL are presented later. Connecting to a shard requires the server host and IP for the mongo router, which is the mongos process. PyMODM ODM Similar to Ruby's Mongoid, PyMODM is an ODM for Python that follows closely on Django's built-in ORM. Installing it can be done via pip: pip install pymodm Then we need to edit settings.py and replace the database engine with a dummy database: DATABASES = { 'default': { 'ENGINE': 'django.db.backends.dummy' } } And add our connection string anywhere in settings.py: from pymodm import connect connect("mongodb://localhost:27017/myDatabase", alias="MyApplication") Here we have to use a connection string that has the following structure: mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:port N]]][/[database][?options]] Options have to be pairs of name=value with an & between each pair. Some interesting pairs are: Model classes need to inherit from MongoModel. A sample class will look like this: from pymodm import MongoModel, fields class User(MongoModel): email = fields.EmailField(primary_key=True) first_name = fields.CharField() last_name = fields.CharField() This has a User class with first_name, last_name, and email fields where email is the primary field. Inheritance with PyMODM models Handling one-one and one-many relationships in MongoDB can be done using references or embedding. This example shows both ways: references for the model user and embedding for the comment model: from pymodm import EmbeddedMongoModel, MongoModel, fields class Comment(EmbeddedMongoModel): author = fields.ReferenceField(User) content = fields.CharField() class Post(MongoModel): title = fields.CharField() author = fields.ReferenceField(User) revised_on = fields.DateTimeField() content = fields.CharField() comments = fields.EmbeddedDocumentListField(Comment) Connecting with PHP The MongoDB PHP driver was rewritten from scratch two years ago to support the PHP 5, PHP 7, and HHVM architectures. The current architecture is shown in the following diagram: Currently we have official drivers for all three architectures with full support for the underlying functionality. Installation is a two-step process. First we need to install the MongoDB extension. This extension is dependent on the version of PHP (or HHVM) that we have installed and can be done using brew in Mac. For example with PHP 7.0: brew install php70-mongodb Then, using composer (a widely used dependency manager for PHP): composer require mongodb/mongodb Connecting to the database can then be done by using the connection string URI or by passing an array of options. Using the connection string URI we have: $client = new MongoDBClient($uri = 'mongodb://127.0.0.1/', array $uriOptions = [], array $driverOptions = []) For example, to connect to a replica set using SSL authentication: $client = new MongoDBClient('mongodb://myUsername:myPassword@rs1.example.com,rs2.example.com/?ssl=true&replicaSet=myReplicaSet&authSource=admin'); Or we can use the $uriOptions parameter to pass in parameters without using the connection string URL, like this: $client = new MongoDBClient( 'mongodb://rs1.example.com,rs2.example.com/' [ 'username' => 'myUsername', 'password' => 'myPassword', 'ssl' => true, 'replicaSet' => 'myReplicaSet', 'authSource' => 'admin', ], ); The set of $uriOptions and the connection string URL options available are analogous to the ones used for Ruby and Python. Doctrine ODM Laravel is one of the most widely used MVC frameworks for PHP, similar in architecture to Django and Rails from the Python and Ruby worlds respectively. We will follow through configuring our models using a stack of Laravel, Doctrine, and MongoDB. This section assumes that Doctrine is installed and working with Laravel 5.x. Doctrine entities are POPO (Plain Old PHP Objects) that, unlike Eloquent, Laravel's default ORM doesn't need to inherit from the Model class. Doctrine uses the Data Mapper pattern, whereas Eloquent uses Active Record. Skipping the get() set() methods, a simple class would look like: use DoctrineORMMapping AS ORM; use DoctrineCommonCollectionsArrayCollection; /** * @ORMEntity * @ORMTable(name="scientist") */ class Scientist { /** * @ORMId * @ORMGeneratedValue * @ORMColumn(type="integer") */ protected $id; /** * @ORMColumn(type="string") */ protected $firstname; /** * @ORMColumn(type="string") */ protected $lastname; /** * @ORMOneToMany(targetEntity="Theory", mappedBy="scientist", cascade={"persist"}) * @var ArrayCollection|Theory[] */ protected $theories; /** * @param $firstname * @param $lastname */ public function __construct($firstname, $lastname) { $this->firstname = $firstname; $this->lastname = $lastname; $this->theories = new ArrayCollection; } … public function addTheory(Theory $theory) { if(!$this->theories->contains($theory)) { $theory->setScientist($this); $this->theories->add($theory); } } This POPO-based model used annotations to define field types that need to be persisted in MongoDB. For example, @ORMColumn(type="string") defines a field in MongoDB with the string type firstname and lastname as the attribute names, in the respective lines. There is a whole set of annotations available here http://doctrine-orm.readthedocs.io/en/latest/reference/annotations- reference.html . If we want to separate the POPO structure from annotations, we can also define them using YAML or XML instead of inlining them with annotations in our POPO model classes. Inheritance with Doctrine Modeling one-one and one-many relationships can be done via annotations, YAML, or XML. Using annotations, we can define multiple embedded subdocuments within our document: /** @Document */ class User { // … /** @EmbedMany(targetDocument="Phonenumber") */ private $phonenumbers = array(); // … } /** @EmbeddedDocument */ class Phonenumber { // … } Here a User document embeds many PhoneNumbers. @EmbedOne() will embed one subdocument to be used for modeling one-one relationships. Referencing is similar to embedding: /** @Document */ class User { // … /** * @ReferenceMany(targetDocument="Account") */ private $accounts = array(); // … } /** @Document */ class Account { // … } @ReferenceMany() and @ReferenceOne() are used to model one-many and one-one relationships via referencing into a separate collection. We saw that the process of connecting data to MongoDB using Python and PHP is quite similar. We can accordingly define relationships as being embedded or referenced, depending on our design decision. If you found this post useful, check out our book Mastering MongoDB 3.x for more tips and techniques on working with MongoDB efficiently.

0
0
18911

article-image-oracle-12c-sql-plsql-new-features

Oli Huggins

16 Aug 2015

30 min read

Oracle 12c SQL and PL/SQL New Features

Oli Huggins

16 Aug 2015

30 min read

0
0
18800

article-image-how-facebook-is-advancing-artificial-intelligence-video

Richard Gall

14 Sep 2018

4 min read

How Facebook is advancing artificial intelligence [Video]

Richard Gall

14 Sep 2018

4 min read

0
0
18689

article-image-nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh

Savia Lobo

15 Dec 2017

8 min read

NIPS 2017 Special: A deep dive into Deep Bayesian and Bayesian Deep Learning with Yee Whye Teh

Savia Lobo

15 Dec 2017

8 min read

Yee Whye Teh is a professor at the department of Statistics of the University of Oxford and also a research scientist at DeepMind. He works on statistical machine learning, focussing on Bayesian nonparametrics, probabilistic learning, and deep learning. The motive of this article aims to bring our readers to Yee’s keynote speech at the NIPS 2017. Yee’s keynote ponders deeply on the interface between two perspectives on machine learning: Bayesian learning and Deep learning by exploring questions like: How can probabilistic thinking help us understand deep learning methods or lead us to interesting new methods? Conversely, how can deep learning technologies help us develop advanced probabilistic methods? For a more comprehensive and in-depth understanding of this novel approach, be sure to watch the complete keynote address by Yee Whye Teh on NIPS facebook page. All images in this article come from Yee’s presentation slides and do not belong to us. The history of machine learning has shown a growth in both model complexity and in model flexibility. The theory led models have started to lose their shine. This is because machine learning is at the forefront of a revolution that could be called as data led models or the data revolution. As opposed to theory led models, data-led models try not to impose too many assumptions on the processes that have to be modeled and are rather superflexible non-parametric models that can capture the complexities but they require large amount of data to operate. On the model flexibility side, we have various approaches that have been explored over the years. We have kernel methods, Gaussian processes, Bayesian nonparametrics and now we have deep learning as well. The community has also developed evermore complex frameworks both graphical and programmatic to compose large complex models from simpler building blocks. In the 90’s we had graphical models, later we had probabilistic programming systems, followed by deep learning systems like TensorFlow, Theano, and Torch. A recent addition is probabilistic Torch, which brings together ideas from both the probabilistic Bayesian learning and deep learning. On one hand we have Bayesian learning, which deals with learning as inference in some probabilistic models. On the other hand we have deep learning models, which view learning as optimization functions parametrized by neural networks. In recent years there has been an explosion of exciting research at this interface of these two popular approaches resulting in increasingly complex and exciting models. What is Bayesian theory of learning Bayesian learning describes an ideal learner as one who interacts with the world in order to know its state, which is given by θ. He/she makes some observations about the world by deducing a model in Bayesian context. This model is a joint distribution of both the unknown state of the world θ and the observation about the world x. The model consists of prior distribution and marginal distribution, combining which gives a reverse conditional distribution also known as posterior, which describes the totality of the agent's knowledge about the world after he/she sees x. This posterior can also be used for predicting future observations and act accordingly. Issues associated with Bayesian learning Rigidity Learning can be wrong if model is wrong Not all prior knowledge can be encoded as joint distribution Simple analytic forms are limiting for conditional distributions 2. Scalability: Intractable to compute this posterior and approximations have to be made, which then introduces trade offs between efficiency and accuracy. As a result, it is often assumed that Bayesian techniques are not scalable. To address these issues, the speaker highlights some of his recent projects which showcase scenarios where deep learning ideas are applied to Bayesian models (Deep Bayesian learning) or in the reverse applying Bayesian ideas to Neural Networks ( i.e. Bayesian Deep learning) Deep Bayesian learning: Deep learning assists Bayesian learning Deep learning can improve Bayesian learning in the following ways: Improve the modeling flexibility by using neural networks in the construction of Bayesian models Improve the inference and scalability of these methods by parameterizing the posterior way of using neural networks Empathizing inference over multiple runs These can be seen in the following projects showcased by Yee: Concrete VAEs(Variational Autoencoders) FIVO: Filtered Variational Objectives Concrete VAEs What are VAEs? All the qualities mentioned above, i.e. improving modeling flexibility, improving inference and scalability, and empathizing inference over multiple runs by using neural networks can be seen in a class of deep generative models known as VAE (Variational Autoencoders). Fig: Variational Autoencoders VAEs include latent variables that describe the contents of a scene i.e objects, pose. The relationship between these latent variables and the pixels have to be highly complex and nonlinear. So, in short, VAEs are used to parameterize generative and variable posterior distribution that allows for greater scope flexible modeling. The key that makes VAEs work is the reparameterization trick Fig: Adding reparameterization to VAEs The reparameterization trick is crucial to the continuous latent variables in the VAEs. But many models naturally include discrete latent variables. Yee suggests application of the reparameterization on the discrete latent variables as a work around. This brings us to the concept of Concrete VAEs.. CONtinuous relaxation of disCRETE distributions.Also, the density can be further calculated: This concrete distribution is the reparameterization trick for discrete variables which helps in calculating the KL divergence that is needed for variational inference. FIVO: Filtered Variational Objectives FIVO extends VAEs towards models for sequential and time series data. It is built upon another extension of VAEs known as Importance Weighted Autoencoder, a generative model with a similar as that of the VAE, but which uses a strictly tighter log-likelihood lower bound. Variational lower bound: Rederivation from importance sampling: Better to use multiple samples: Using Importance Weighted Autoencoders we can use multiple sampling, with which we can get a tighter lower bound and optimizing this lower bound should lead to better learning. Let’s have a look at the FIVO objectives: We can use any unbiased estimator p(X) of marginal probabilityTightness of bound related to variance of estimatorFor sequential models, we can use particle filters which produce unbiased estimator of marginal probability. They can also have much lower variance than importance samplers. Bayesian Deep learning: Bayesian approach for deep learning gives us counterintuitive and surprising ways to make deep learning scalable. In order to explore the potential of Bayesian learning with deep neural networks, Yee introduced a project named, The posterior server. The Posterior server The posterior server is a distributed server for deep learning. It makes use of the Bayesian approach in order to make neural networks highly scalable. This project focuses on Distributed learning, where both the data and the computations can be spread across the network. The figure above shows that there are a bunch of workers and each communicates with the parameter server, which effectively maintains the authoritative copy of the parameters of the network. At each iteration, each worker obtains the latest copy of the parameter from the server, computes the gradient update based on its data and sends it back to the server which then updates it to the authoritative copy. So, communications on the network tend to be slower than the computations that can be done on the network. Hence, one might consider multiple gradient steps on each iteration before it sends the accumulated update back to the parameter server. The problem is that the parameter and the worker quickly get out of sync with the authoritative copy on the parameter server. As a result, this leads to stale updates which allow noise into the system and we often need frequent synchronizations across the network for the algorithm to learn in a stable fashion. The main idea here in Bayesian context is that we don't just want a single parameter, we want a whole distribution over them. This will then relax the need for frequent synchronizations across the network and hopefully lead to algorithms that are robust to last frequent communication. Each worker is simply going to construct its own tractable approximation to his own likelihood function and send this information to the posterior server which then combines these approximations together to form the full posterior or an approximation of it. Further, the approximations that are constructed would be based on the statistics of some sampling algorithms that happens locally on that worker. The actual algorithm includes a combination of the variational algorithms, Stochastic Gradient EP and the Markov chain Monte Carlo on the workers themselves. So the variational part in the algorithm handles the communication part in the network whereas the MCMC part handles the sampling part that is posterior to construct the statistics that the variational part needs. For scalability, a stochastic gradient Langevin algorithm which is a simple generalization of the SGT, which includes additional injected noise, to sample from posterior noise. To experiment with this server, it was trained densely connected neural networks with 500 reLU units on MNIST dataset. You can have a detailed understanding of these examples in the keynote video. This interface between Bayesian learning and deep learning is a very exciting frontier. Researchers have brought management of uncertainties within deep learning. Also, flexibility and scalability in Bayesian modeling. Yee concludes with two questions for the audience to think about. Does being Bayesian in the space of functions makes more sense than being Bayesian in the sense of parameters? How to deal with uncertainties under model misspecification?

0
0
18628

article-image-thomas-munro-from-enterprisedb-on-parallelism-in-postgresql

Bhagyashree R

17 Dec 2019

7 min read

Thomas Munro from EnterpriseDB on parallelism in PostgreSQL

Bhagyashree R

17 Dec 2019

7 min read

PostgreSQL is a powerful, open-source object-relational database system. Since its introduction, it has been well-received by developers for its reliability, feature robustness, data-integrity, better licensing, and much more. However, one of its limitations has been the lack of support for parallelism, which changed in the subsequent releases. At PostgresOpen 2018, Thomas Munro, a programmer at EnterpriseDB and PostgreSQL contributor talked about how parallelism has evolved in PostgreSQL over the years. In this article, we will see some of the key parallelism-specific features that Munro discussed in his talk. [box type="shadow" align="" class="" width=""] Further Learning This article gives you a glimpse of query parallelism in PostgreSQL. If you want to explore it further along with other concepts like data replication, and database performance, check out our book Mastering PostgreSQL 11 - Second Edition by Hans-Jürgen Schönig. This second edition of Mastering PostgreSQL 11 helps you build dynamic database solutions for enterprise applications using PostgreSQL, which enables database analysts to design both the physical and technical aspects of the system architecture with ease. [/box] Evolution of parallelism in PostgreSQL PostgreSQL uses a process-based architecture instead of a thread-based one. On startup, it launches a “postmaster” process and after that creates a new process for every database session. Previously, it did not support parallelism in a single connection and each query used to run serially. The absence of “intra-query parallelism” in PostgreSQL was a huge limitation for answering the queries faster. Parallelism here means allowing a single process to have multiple threads to query the system and utilize the increasing CPU core counts. The foundation for parallelism in PostgreSQL was laid out in the 9.4 and 9.5 releases. These came with infrastructure updates like dynamic shared memory segments, shared memory queues, and background workers. PostgreSQL 9.6 was actually the first release that came with user-visible features for parallel query execution. It supported executor nodes: gather, parallel sequential scan, partial aggregate, and finalize aggregate. However, this was not enabled by default. Then in 2017, PostgreSQL 10 was released, which had parallelism enabled by default. It had a few more executor nodes including gather merge, parallel index scan, and parallel bitmap heap scan. Last year, PostgreSQL 11 came out with a couple of more executor nodes including parallel append and parallel hash join. It also introduced partition-wise joins and parallel CREATE INDEX. Key parallelism-specific features in PostgreSQL Parallel sequential scans Parallel sequential scans was the very first feature for parallel query execution. Introduced in PostgreSQL 9.6, this scan distributes blocks of a table among different processes. This assignment is done one after the other to ensure that the access to the table remains sequential. The processes that run in parallel and scan the tuples of a table are called parallel workers. There is one special worker called leader, which is responsible for coordinating and collecting the output of the scan from each of the worker. The leader may or may not participate in scanning the database depending on its load in dividing and combining processes. Parallel index scan Parallel index scan is based on the same concept as parallel sequential scan, but it involves more communication and waiting. Currently, the parallel index scans are supported only for B-Tree indexes. In a parallel index scan, index pages are scanned in parallel. Each process will scan a single index block and return all tuples referenced by that block. Meanwhile, other processes will also scan different index blocks and return the tuples. The results of a parallel B-Tree scan are then returned in sorted order. Parallel bitmap heap scan Again, this also has the same concept as the parallel sequential scan. Explaining the difference, Munro said, “You’ve got a big bitmap and you are skipping ahead to the pages that contain interesting tuples.” In parallel bitmap heap scan, one process is chosen as the leader, who performs a scan of one or more indexes and creates bitmap indicating which table blocks need to be visited. These table blocks are then divided among the worker processes as in a parallel sequential scan. Here the heap scan is done in parallel, but the underlying index scan is not. Parallel joins PostgreSQL supports all three join strategies in parallel query plans: nested loop join, hash join, or merge join. However, there is no parallelism supported in the inner loop. The entire loop is scanned as a whole, and the parallelism comes into play when each worker executes the inner loop as a whole. The results of each join are sent to gather node to produce the final results. Nested loop join: The nested loop is the most basic way for PostgreSQL to perform a join. Though it is considered to be slow, it can be efficient if the inner side is an index scan. This is because the outer tuples and hence the loops that loop up values in the index will be divided among worker processes. Merge join: The inner side is executed in full. It can be inefficient when sort needs to be performed because the work and resulting data are duplicated in every cooperating process. Hash join: In this join as well, the inner side is executed in full by every worker process to build identical copies of the hash table. It is inefficient in cases when the hash table is large or the plan is expensive. However, in parallel hash join, the inner side is a parallel hash that divides the work of building a shared hash table over the cooperating processes. This is the only join in which we can have parallelism on both sides. Partition-wise join Partition-wise join is a new feature introduced in PostgreSQL 11. In partition-wise join, the planner knows that both sides of the join have matching partition schemes. Here a join between two similarly partitioned tables are broken down into joins between their matching partitions if there is an equi-join condition between the partition key of joining tables. Munro explains, “It becomes parallelizable with the advent of parallel append, which can then run different branches of that query plan in different processes. But if you do that then granularity of parallelism is partitioned, which is in some ways good and in some ways bad compared to block-based granularity.” He further adds, “It means when the last worker runs out of work to do everyone else has to wait for that before the query is finished. Whereas, if you use block-based parallelism you don’t have the problem but there are some advantages as a result of that as well.” Parallel aggregation in PostgreSQL Calculating aggregates can be very expensive and when evaluated in a single process it could take a considerable amount of time. This problem was solved in PostgreSQL 9.6 with the introduction of parallel aggregation. This is essentially a divide and conquer strategy where multiple workers calculate a part of aggregate before the final value based on these calculations is calculated by the leader. This article walked you through some of the parallelism-specific features in PostgreSQL presented by Munro in his PostgresOpen 2018 talk. If you want to get to grips with other advanced PostgreSQL features and SQL functions, do have a look at our Mastering PostgreSQL 11 - Second Edition book by Hans-Jürgen Schönig. By the end of this book, you will be able to use your database to its utmost capacity by implementing advanced administrative tasks with ease. PostgreSQL committer Stephen Frost shares his vision for PostgreSQL version 12 and beyond Introducing PostgREST, a REST API for any PostgreSQL database written in Haskell Percona announces Percona Distribution for PostgreSQL to support open source databases

0
0
18587

article-image-why-are-experts-worried-about-microsofts-billion-dollar-bet-in-openais-agi-pipe-dream

Sugandha Lahoti

23 Jul 2019

6 min read

Why are experts worried about Microsoft's billion dollar bet in OpenAI's AGI pipe dream?

Sugandha Lahoti

23 Jul 2019

6 min read

0
0
18493

article-image-facebook-research-suggests-chatbots-and-conversational-ai-will-empathize-humans

Fatema Patrawala

06 Aug 2019

6 min read

Facebook research suggests chatbots and conversational AI are on the verge of empathizing with humans

Fatema Patrawala

06 Aug 2019

6 min read

Last week, the Facebook AI research team published a progress report on dialogue research that is fundamentally building more engageable and personalized AI systems. According to the team, “Dialogue research is a crucial component of building the next generation of intelligent agents. While there’s been progress with chatbots in single-domain dialogue, agents today are far from capable of carrying an open-domain conversation across a multitude of topics. Agents that can chat with humans in the way that people talk to each other will be easier and more enjoyable to use in our day-to-day lives — going beyond simple tasks like playing a song or booking an appointment.” In their blog post, they have described new open source data sets, algorithms, and models that improve five common weaknesses of open-domain chatbots today. The weaknesses identified are maintaining consistency, specificity, empathy, knowledgeability, and multimodal understanding. Let us look at each one in detail: Dataset called Dialogue NLI introduced for maintaining consistency Inconsistencies are a common issue for chatbots partly because most models lack explicit long-term memory and semantic understanding. Facebook team in collaboration with their colleagues at NYU, developed a new way of framing consistency of dialogue agents as natural language inference (NLI) and created a new NLI data set called Dialogue NLI, used to improve and evaluate the consistency of dialogue models. The team showcased an example in the Dialogue NLI model, where in they considered two utterances in a dialogue as the premise and hypothesis, respectively. Each pair was labeled to indicate whether the premise entails, contradicts, or is neutral with respect to the hypothesis. Training an NLI model on this data set and using it to rerank the model’s responses to entail previous dialogues — or maintain consistency with them — improved the overall consistency of the dialogue agent. Across these tests they say they saw 3x lesser contradictions in the sentences. Several conversational attributes were studied to balance specificity As per the team, generative dialogue models frequently default to generic, safe responses, like “I don’t know” to some query which needs specific responses. Hence, the Facebook team in collaboration with Stanford’s AI researcher Abigail See, studied how to fix this by controlling several conversational attributes, like the level of specificity. In one experiment, they conditioned a bot on character information and asked “What do you do for a living?” A typical chatbot responds with the generic statement “I’m a construction worker.” With control methods, the chatbots proposed more specific and engaging responses, like “I build antique homes and refurbish houses." In addition to specificity, the team mentioned, “that balancing question-asking and answering and controlling how repetitive our models are make significant differences. The better the overall conversation flow, the more engaging and personable the chatbots and dialogue agents of the future will be.” Chatbot’s ability to display empathy while responding was measured The team worked with researchers from the University of Washington to introduce the first benchmark task of human-written empathetic dialogues centered on specific emotional labels to measure a chatbot’s ability to display empathy. In addition to improving on automatic metrics, the team showed that using this data for both fine-tuning and as retrieval candidates leads to responses that are evaluated by humans as more empathetic, with an average improvement of 0.95 points (on a 1-to-5 scale) across three different retrieval and generative models. The next challenge for the team is that empathy-focused models should perform well in complex dialogue situations, where agents may require balancing empathy with staying on topic or providing information. Wikipedia dataset used to make dialogue models more knowledgeable The research team has improved dialogue models’ capability of demonstrating knowledge by collecting a data set with conversations from Wikipedia, and creating new model architectures that retrieve knowledge, read it, and condition responses on it. This generative model has yielded the most pronounced improvement and it is rated by humans as 26% more engaging than their knowledgeless counterparts. To engage with images, personality based captions were used To engage with humans, agents should not only comprehend dialogue but also understand images. In this research, the team focused on image captioning that is engaging for humans by incorporating personality. They collected a data set of human comments grounded in images, and trained models capable of discussing images with given personalities, which makes the system interesting for humans to talk to. 64% humans preferred these personality-based captions over traditional captions. To build strong models, the team considered both retrieval and generative variants, and leveraged modules from both the vision and language domains. They defined a powerful retrieval architecture, named TransResNet, that works by projecting the image, personality, and caption in the same space using image, personality, and text encoders. The team showed that their system was able to produce captions that are close to matching human performance in terms of engagement and relevance. And annotators preferred their retrieval model’s captions over captions written by people 49.5% of the time. Apart from this, Facebook team has released a new data collection and model evaluation tool, a Messenger-based Chatbot game called Beat the Bot, that allows people to interact directly with bots and other humans in real time, creating rich examples to help train models. To conclude, the Facebook AI team mentions, “Our research has shown that it is possible to train models to improve on some of the most common weaknesses of chatbots today. Over time, we’ll work toward bringing these subtasks together into one unified intelligent agent by narrowing and eventually closing the gap with human performance. In the future, intelligent chatbots will be capable of open-domain dialogue in a way that’s personable, consistent, empathetic, and engaging.” On Hacker News, this research has gained positive and negative reviews. Some of them discuss that if AI will converse like humans, it will do a lot of bad. While other users say that this is an impressive improvement in the field of conversational AI. A user comment reads, “I gotta say, when AI is able to converse like humans, a lot of bad stuff will happen. People are so used to the other conversation partner having self-interest, empathy, being reasonable. When enough bots all have a “swarm” program to move conversations in a particular direction, they will overwhelm any public conversation. Moreover, in individual conversations, you won’t be able to trust anything anyone says or negotiates. Just like playing chess or poker online now. And with deepfakes, you won’t be able to trust audio or video either. The ultimate shock will come when software can render deepfakes in realtime to carry on a conversation, as your friend but not. As a politician who “said crazy stuff” but really didn’t, but it’s in the realm of believability. I would give it about 20 years until it all goes to shit. If you thought fake news was bad, realtime deepfakes and AI conversations with “friends” will be worse. Scroll Snapping and other cool CSS features come to Firefox 68 Google Chrome to simplify URLs by hiding special-case subdomains Lyft releases an autonomous driving dataset “Level 5” and sponsors research competition

0
0
18468

article-image-introduction-clustering-and-unsupervised-learning

Packt

23 Feb 2016

16 min read

Introduction to Clustering and Unsupervised Learning

Packt

23 Feb 2016

16 min read

The act of clustering, or spotting patterns in data, is not much different from spotting patterns in groups of people. In this article, you will learn: The ways clustering tasks differ from the classification tasks How clustering defines a group, and how such groups are identified by k-means, a classic and easy-to-understand clustering algorithm The steps needed to apply clustering to a real-world task of identifying marketing segments among teenage social media users Before jumping into action, we'll begin by taking an in-depth look at exactly what clustering entails. (For more resources related to this topic, see here.) Understanding clustering Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data. Without advance knowledge of what comprises a cluster, how can a computer possibly know where one group ends and another begins? The answer is simple. Clustering is guided by the principle that items inside a cluster should be very similar to each other, but very different from those outside. The definition of similarity might vary across applications, but the basic idea is always the same—group the data so that the related elements are placed together. The resulting clusters can then be used for action. For instance, you might find clustering methods employed in the following applications: Segmenting customers into groups with similar demographics or buying patterns for targeted marketing campaigns Detecting anomalous behavior, such as unauthorized network intrusions, by identifying patterns of use falling outside the known clusters Simplifying extremely large datasets by grouping features with similar values into a smaller number of homogeneous categories Overall, clustering is useful whenever diverse and varied data can be exemplified by a much smaller number of groups. It results in meaningful and actionable data structures that reduce complexity and provide insight into patterns of relationships. Clustering as a machine learning task Clustering is somewhat different from the classification, numeric prediction, and pattern detection tasks we examined so far. In each of these cases, the result is a model that relates features to an outcome or features to other features; conceptually, the model describes the existing patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label that has been inferred entirely from the relationships within the data. For this reason, you will, sometimes, see the clustering task referred to as unsupervised classification because, in a sense, it classifies unlabeled examples. The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return the groups A, B, and C—but it's up to you to apply an actionable and meaningful label. To see how this impacts the clustering task, let's consider a hypothetical example. Suppose you were organizing a conference on the topic of data science. To facilitate professional networking and collaboration, you planned to seat people in groups according to one of three research specialties: computer and/or database science, math and statistics, and machine learning. Unfortunately, after sending out the conference invitations, you realize that you had forgotten to include a survey asking which discipline the attendee would prefer to be seated with. In a stroke of brilliance, you realize that you might be able to infer each scholar's research specialty by examining his or her publication history. To this end, you begin collecting data on the number of articles each attendee published in computer science-related journals and the number of articles published in math or statistics-related journals. Using the data collected for several scholars, you create a scatterplot: As expected, there seems to be a pattern. We might guess that the upper-left corner, which represents people with many computer science publications but few articles on math, could be a cluster of computer scientists. Following this logic, the lower-right corner might be a group of mathematicians. Similarly, the upper-right corner, those with both math and computer science experience, may be machine learning experts. Our groupings were formed visually; we simply identified clusters as closely grouped data points. Yet in spite of the seemingly obvious groupings, we unfortunately have no way to know whether they are truly homogeneous without personally asking each scholar about his/her academic specialty. The labels we applied required us to make qualitative, presumptive judgments about the types of people that would fall into the group. For this reason, you might imagine the cluster labels in uncertain terms, as follows: Rather than defining the group boundaries subjectively, it would be nice to use machine learning to define them objectively. This might provide us with a rule in the form if a scholar has few math publications, then he/she is a computer science expert. Unfortunately, there's a problem with this plan. As we do not have data on the true class value for each point, a supervised learning algorithm would have no ability to learn such a pattern, as it would have no way of knowing what splits would result in homogenous groups. On the other hand, clustering algorithms use a process very similar to what we did by visually inspecting the scatterplot. Using a measure of how closely the examples are related, homogeneous groups can be identified. In the next section, we'll start looking at how clustering algorithms are implemented. This example highlights an interesting application of clustering. If you begin with unlabeled data, you can use clustering to create class labels. From there, you could apply a supervised learner such as decision trees to find the most important predictors of these classes. This is called semi-supervised learning. The k-means clustering algorithm The k-means algorithm is perhaps the most commonly used clustering method. Having been studied for several decades, it serves as the foundation for many more sophisticated clustering techniques. If you understand the simple principles it uses, you will have the knowledge needed to understand nearly any clustering algorithm in use today. Many such methods are listed on the following site, the CRAN Task View for clustering at http://cran.r-project.org/web/views/Cluster.html. As k-means has evolved over time, there are many implementations of the algorithm. One popular approach is described in : Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics. 1979; 28:100-108. Even though clustering methods have advanced since the inception of k-means, this is not to imply that k-means is obsolete. In fact, the method may be more popular now than ever. The following table lists some reasons why k-means is still used widely: Strengths Weaknesses Uses simple principles that can be explained in non-statistical terms Highly flexible, and can be adapted with simple adjustments to address nearly all of its shortcomings Performs well enough under many real-world use cases Not as sophisticated as more modern clustering algorithms Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters Requires a reasonable guess as to how many clusters naturally exist in the data Not ideal for non-spherical clusters or clusters of widely varying density The k-means algorithm assigns each of the n examples to one of the k clusters, where k is a number that has been determined ahead of time. The goal is to minimize the differences within each cluster and maximize the differences between the clusters. Unless k and n are extremely small, it is not feasible to compute the optimal clusters across all the possible combinations of examples. Instead, the algorithm uses a heuristic process that finds locally optimal solutions. Put simply, this means that it starts with an initial guess for the cluster assignments, and then modifies the assignments slightly to see whether the changes improve the homogeneity within the clusters. We will cover the process in depth shortly, but the algorithm essentially involves two phases. First, it assigns examples to an initial set of k clusters. Then, it updates the assignments by adjusting the cluster boundaries according to the examples that currently fall into the cluster. The process of updating and assigning occurs several times until changes no longer improve the cluster fit. At this point, the process stops and the clusters are finalized. Due to the heuristic nature of k-means, you may end up with somewhat different final results by making only slight changes to the starting conditions. If the results vary dramatically, this could indicate a problem. For instance, the data may not have natural groupings or the value of k has been poorly chosen. With this in mind, it's a good idea to try a cluster analysis more than once to test the robustness of your findings. To see how the process of assigning and updating works in practice, let's revisit the case of the hypothetical data science conference. Though this is a simple example, it will illustrate the basics of how k-means operates under the hood. Using distance to assign and update clusters As with k-NN, k-means treats feature values as coordinates in a multidimensional feature space. For the conference data, there are only two features, so we can represent the feature space as a two-dimensional scatterplot as depicted previously. The k-means algorithm begins by choosing k points in the feature space to serve as the cluster centers. These centers are the catalyst that spurs the remaining examples to fall into place. Often, the points are chosen by selecting k random examples from the training dataset. As we hope to identify three clusters, according to this method, k = 3 points will be selected at random. These points are indicated by the star, triangle, and diamond in the following diagram: It's worth noting that although the three cluster centers in the preceding diagram happen to be widely spaced apart, this is not always necessarily the case. Since they are selected at random, the three centers could have just as easily been three adjacent points. As the k-means algorithm is highly sensitive to the starting position of the cluster centers, this means that random chance may have a substantial impact on the final set of clusters. To address this problem, k-means can be modified to use different methods for choosing the initial centers. For example, one variant chooses random values occurring anywhere in the feature space (rather than only selecting among the values observed in the data). Another option is to skip this step altogether; by randomly assigning each example to a cluster, the algorithm can jump ahead immediately to the update phase. Each of these approaches adds a particular bias to the final set of clusters, which you may be able to use to improve your results. In 2007, an algorithm called k-means++ was introduced, which proposes an alternative method for selecting the initial cluster centers. It purports to be an efficient way to get much closer to the optimal clustering solution while reducing the impact of random chance. For more information, refer to Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. 2007:1027–1035. After choosing the initial cluster centers, the other examples are assigned to the cluster center that is nearest according to the distance function. You will remember that we studied distance functions while learning about k-Nearest Neighbors. Traditionally, k-means uses Euclidean distance, but Manhattan distance or Minkowski distance are also sometimes used. Recall that if n indicates the number of features, the formula for Euclidean distance between example x and example y is: For instance, if we are comparing a guest with five computer science publications and one math publication to a guest with zero computer science papers and two math papers, we could compute this in R as follows: > sqrt((5 - 0)^2 + (1 - 2)^2) [1] 5.09902 Using this distance function, we find the distance between each example and each cluster center. The example is then assigned to the nearest cluster center. Keep in mind that as we are using distance calculations, all the features need to be numeric, and the values should be normalized to a standard range ahead of time. As shown in the following diagram, the three cluster centers partition the examples into three segments labeled Cluster A, Cluster B, and Cluster C. The dashed lines indicate the boundaries for the Voronoi diagram created by the cluster centers. The Voronoi diagram indicates the areas that are closer to one cluster center than any other; the vertex where all the three boundaries meet is the maximal distance from all three cluster centers. Using these boundaries, we can easily see the regions claimed by each of the initial k-means seeds: Now that the initial assignment phase has been completed, the k-means algorithm proceeds to the update phase. The first step of updating the clusters involves shifting the initial centers to a new location, known as the centroid, which is calculated as the average position of the points currently assigned to that cluster. The following diagram illustrates how as the cluster centers shift to the new centroids, the boundaries in the Voronoi diagram also shift and a point that was once in Cluster B (indicated by an arrow) is added to Cluster A: As a result of this reassignment, the k-means algorithm will continue through another update phase. After shifting the cluster centroids, updating the cluster boundaries, and reassigning points into new clusters (as indicated by arrows), the figure looks like this: Because two more points were reassigned, another update must occur, which moves the centroids and updates the cluster boundaries. However, because these changes result in no reassignments, the k-means algorithm stops. The cluster assignments are now final: The final clusters can be reported in one of the two ways. First, you might simply report the cluster assignments such as A, B, or C for each example. Alternatively, you could report the coordinates of the cluster centroids after the final update. Given either reporting method, you are able to define the cluster boundaries by calculating the centroids or assigning each example to its nearest cluster. Choosing the appropriate number of clusters In the introduction to k-means, we learned that the algorithm is sensitive to the randomly-chosen cluster centers. Indeed, if we had selected a different combination of three starting points in the previous example, we may have found clusters that split the data differently from what we had expected. Similarly, k-means is sensitive to the number of clusters; the choice requires a delicate balance. Setting k to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data. Ideally, you will have a priori knowledge (a prior belief) about the true groupings and you can apply this information to choosing the number of clusters. For instance, if you were clustering movies, you might begin by setting k equal to the number of genres considered for the Academy Awards. In the data science conference seating problem that we worked through previously, k might reflect the number of academic fields of study that were invited. Sometimes the number of clusters is dictated by business requirements or the motivation for the analysis. For example, the number of tables in the meeting hall could dictate how many groups of people should be created from the data science attendee list. Extending this idea to another business case, if the marketing department only has resources to create three distinct advertising campaigns, it might make sense to set k = 3 to assign all the potential customers to one of the three appeals. Without any prior knowledge, one rule of thumb suggests setting k equal to the square root of (n / 2), where n is the number of examples in the dataset. However, this rule of thumb is likely to result in an unwieldy number of clusters for large datasets. Luckily, there are other statistical methods that can assist in finding a suitable k-means cluster set. A technique known as the elbow method attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k. As illustrated in the following diagrams, the homogeneity within clusters is expected to increase as additional clusters are added; similarly, heterogeneity will also continue to decrease with more clusters. As you could continue to see improvements until each example is in its own cluster, the goal is not to maximize homogeneity or minimize heterogeneity, but rather to find k so that there are diminishing returns beyond that point. This value of k is known as the elbow point because it looks like an elbow. There are numerous statistics to measure homogeneity and heterogeneity within the clusters that can be used with the elbow method (the following information box provides a citation for more detail). Still, in practice, it is not always feasible to iteratively test a large number of k values. This is in part because clustering large datasets can be fairly time consuming; clustering the data repeatedly is even worse. Regardless, applications requiring the exact optimal set of clusters are fairly rare. In most clustering applications, it suffices to choose a k value based on convenience rather than strict performance requirements. For a very thorough review of the vast assortment of cluster performance measures, refer to: Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. Journal of Intelligent Information Systems. 2001; 17:107-145. The process of setting k itself can sometimes lead to interesting insights. By observing how the characteristics of the clusters change as k is varied, one might infer where the data have naturally defined boundaries. Groups that are more tightly clustered will change a little, while less homogeneous groups will form and disband over time. In general, it may be wise to spend little time worrying about getting k exactly right. The next example will demonstrate how even a tiny bit of subject-matter knowledge borrowed from a Hollywood film can be used to set k such that actionable and interesting clusters are found. As clustering is unsupervised, the task is really about what you make of it; the value is in the insights you take away from the algorithm's findings. Summary This article covered only the fundamentals of clustering. As a very mature machine learning method, there are many variants of the k-means algorithm as well as many other clustering algorithms that bring unique biases and heuristics to the task. Based on the foundation in this article, you will be able to understand and apply other clustering methods to new problems. To learn more about different machine learning techniques, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r) Mastering Scientific Computing with R (https://www.packtpub.com/application-development/mastering-scientific-computing-r) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) Resources for Article: Further resources on this subject: Displaying SQL Server Data using a Linq Data Source [article] Probability of R? [article] Working with Commands and Plugins [article]

0
0
18351

article-image-getting-started-with-the-confluent-platform-apache-kafka-for-enterprise

Amarabha Banerjee

27 Feb 2018

9 min read

Getting started with the Confluent Platform: Apache Kafka for enterprise

Amarabha Banerjee

27 Feb 2018

9 min read

0
0
18216

article-image-how-to-prevent-errors-while-using-utilities-for-loading-data-in-teradata

Pravin Dhandre

11 Jun 2018

9 min read

How to prevent errors while using utilities for loading data in Teradata

Pravin Dhandre

11 Jun 2018

9 min read

In today’s tutorial we will assist you to overcome the errors that arise while loading, deleting or updating large volumes of data using Teradata Utilities. [box type="note" align="" class="" width=""]This article is an excerpt from Teradata Cookbook co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati. This book provides recipes to simplify the daily tasks performed by database administrators (DBA) along with providing efficient data warehousing solutions in Teradata database system.[/box] Resolving FastLoad error 2652 When data is being loaded via FastLoad, a table lock is placed on the target table. This means that the table is unavailable for any other operation. A lock on a table is only released when FastLoad encounters the END LOADING command, which terminates phase 2, the so-called application phase. FastLoad may get terminated in phase 1 due to any of the following reasons: Load script results in failure (error code 8 or 12) Load script is aborted by admin or some other session FastLoad fails due to bad record or file Forgetting to add end loading statement in script If so, it keeps a lock on the table, which needs to be released manually. In this recipe, we will see the steps to release FastLoad locks. Getting ready Identify the table on which FastLoad is been ended prematurely and tables are in locked state. You need to have valid credentials for the Teradata Database. Execute the dummy FastLoad script from the same user or the user which has write access to the lock table. A user requires the following privileges/rights in order to execute the FastLoad: SELECT and INSERT (CREATE and DROP or DELETE) access to the target or loading table CREATE and DROP TABLE on error tables SELECT, INSERT, UPDATE, and DELETE are required privileges for the user PUBLIC on the restart log table (SYSADMIN.FASTLOG). There will be a row in the FASTLOG table for each FastLoad job that has not completed in the system. How to do it... Open a notepad and create the following script: .LOGON 127.0.0.1/dbc, dbc; /* Vaild system name and credentials to your system */ .DATABASE Database_Name; /* database under which locked table is */ erorfiles errortable_name, uv_tablename /* same error table name as in script */ begin loading locked_table; /* table which is getting 2652 error */ .END LOADING; /* to end pahse 2 and release the lock */ .LOGOFF; Save it as dummy_fl.txt. Open the windows Command Prompt and execute this using the FastLoad command, as shown in the following screenshot: This dummy script with no insert statement should release the lock on the target Table. Execute Select on the locked table to see if the lock is released on the table. How it works... As FastLoad is designed to work only on empty tables, it becomes necessary that the loading of the table finishes in one go. If the load script is errored out prematurely in phase 2, without encountering the END loading command, it leaves a lock on loading the table. Fastload locks can't be released via the HUT utility, as there are no technical lock on the table. To execute FastLoad, the following are some requirements: Log table: FastLoad puts its progress information in the fastlog table. EMPTY TABLE: FastLoad needs the table to be empty before inserting rows into that table. TWO ERROR TABLES: FastLoad requires two error tables to be created; you just need to name them, and no ddl is required. The first error table records any translation or constraint violation error, whereas the second error table captures errors related to the duplication of values for Unique Primary Indexes (UPI). After the completion of FastLoad, you can analyze these error tables as to why the records got rejected. There's more... If this does not fix the issue, you need to drop the target table and error tables associated with it. Before proceeding with dropping tables, check with the administrator to abort any FastLoad sessions associated with this table. Resolving MLOAD error 2571 MLOAD works in five phases, unlike FastLoad, which only works in two phases. MLOAD can fail in either phase three or four. Figure shows 5 stages of MLOAD. Preliminary: Basic setup. Syntax checking, establishing session with the Teradata Database, creation of error tables (two error tables per target table), and the creation of work tables and log tables are done in this phase. DML Transaction phase: Request is parse through PE and a step plan is generated. Steps and DML are then sent to AMP and stored in appropriate work tables for each target table. Input data sent will be stored in these work tables, which will be applied to the target table later on. Acquisition phase: Unsorted data is sent to AMP in blocks of 64K. Rows are hashed by PI and sent to appropriate AMPs. Utility places locks on target tables in preparation for the application phase to apply rows in target tables. Application phase: Changes are applied to target tables and NUSI subtables. Lock on table is held in this phase. Cleanup phase: If the error code of all the steps is 0, MLOAD successfully completes and releases all the locks on the specified table. This being the case, all empty error tables, worktables, and the log table are dropped. Getting ready Identify the table which is getting affected by error 2571. Make sure no host utility is running on this table and the load job is in a failed state for this table. How to do it... Check on viewpoint for any active utility job for this table. If you find any active job, let it complete. If there is a reason that you need to release the lock, first abort all the sessions of the host utility from viewpoint. Ask your administrator to do it. Execute the following command: RELEASE MLOAD <databasename.tablename>; > If you get a Not able to release MLOAD Lock error, execute the following Command: /* Release lock in application phase */ RELEASE MLOAD <databasename.tablename> in apply; Once the locks are released you need to drop all the associated error tables, the log table, and work tables with it. Re-execute MLOAD after correcting the error. How it works... The Mload utility places a lock in table headers to alert other utilities that a MultiLoad is in session for this table. They include: Acquisition lock: DML allows all DDL allows DROP only Application lock: DML allows SELECT with ACCESS only DDL allows DROP only There's more... If the release lock statement still gives an error and does not release the lock on the table, you need to use SELECT with the ACCESS lock to copy the content of the locked table to a new one and drop the locked tables. If you start receiving the error 7446 Mload table %ID cannot be released because NUSI exists, you need to drop all the NUSI on the table and use ALTER Table to nonfallback to accomplish the task. Resolving failure 7547 This error is associated with the UPDATE statement, which could be SQL based or could be in MLOAD. Various times, while updating the set of rows in a table, the update fails on Failure 7547 Target row updated by multiple source rows. This error will happen when you update the target with multiple rows from the source. This means there are duplicated values present in the source tables. Getting ready Let's create sample volatile tables and insert values into them. After that, we will execute the UPDATE command, which will fail to result in 7547: Create a TARGET TABLE with the following DDL and insert values into it: ** TARGET TABLE** create volatile table accounts ( CUST_ID, CUST_NAME, Sal )with data primary index(cust_id) insert values (1,'will',2000); insert values (2,'bekky',2800); insert values (3,'himesh',4000); Create a SOURCE TABLE with the following DDL and insert values into it: ** SOURCE TABLE** create volatile table Hr_payhike ( CUST_ID, CUST_NAME, Sal_hike ) with data primary index(cust_id) insert values (1,'will',2030); insert values (1,'bekky',3800); insert values (3,'himesh',7000); Execute the MLOAD script. Following the snippet from the MLOAD script, only update part (which will fail): /* Snippet from MLOAD update */ UPDATE ACC FROM ACCOUNTS ACC , Hr_payhike SUPD SET Sal= TUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Failure: Target row updated by multiple source rows How to do it... Check for duplicate values in the source table using the following: /*Check for duplicate values in source table*/ SELECT cust_id,count(*) from Hr_payhike group by 1 order by 2 desc The output will be generated with CUST_ID =1 and has two values which are causing errors. The reason for this is that while updating the TARGET table, the optimizer won't be able to understand from which row it should update the TARGET row. Who's salary will be updated Will or Bekky? To resolve the error, execute the following update query: /* Update part of MLOAD */ UPDATE ACC FROM ACCOUNTS ACC , ( SELECT CUST_ID, CUST_NAME, SAL_HIKE FROM Hr_payhike QUALIFY ROW_NUMBER() OVER (PARTITION BY CUST_ID ORDER BY CUST_NAME,SAL_HIKE DESC)=1) SUPD SET Sal= SUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Now, the update will run without error. How it works... Failure will happen when you update the target with multiple rows from the source. If you defined a primary index column for your target, and if those columns are in an update query condition, this error will occur. To further resolve this, you can delete the duplicate from the source table itself and execute the original update without any modification. But if the source data can't be changed, then you need to change the update statement. To summarize, we have successfully learned how to overcome or prevent errors while using utilities for loading data into database. You could also check out the Teradata Cookbook for more than 100 recipes on enterprise data warehousing solutions. 2018 is the year of graph databases. Here’s why. 6 reasons to choose MySQL 8 for designing database solutions Amazon Neptune, AWS’ cloud graph database, is now generally available

0
0
18203

article-image-accountability-and-algorithmic-bias-why-diversity-and-inclusion-matters-neurips-invited-talk

Sugandha Lahoti

08 Dec 2018

4 min read

Accountability and algorithmic bias: Why diversity and inclusion matters [NeurIPS Invited Talk]

Sugandha Lahoti

08 Dec 2018

4 min read

One of the most awaited machine learning conference, NeurIPS 2018 is happening throughout this week in Montreal, Canada. It will feature a series of tutorials, invited talks, product releases, demonstrations, presentations, and announcements related to machine learning research. For the first time, NeurIPS invited a diversity and inclusion (D&I) speaker Laura Gomez to talk about the lack of diversity in the tech industry, which leads to biased algorithms, faulty products, and unethical tech. Laura Gomez is the CEO of Atipica that helps tech companies find and hire diverse candidates. Being a Latina woman herself, she had to face oppression when seeking capital and funds for her startup trying to establish herself in Silicon Valley. This experience led to her realization that there is a strong need to talk about why diversity and inclusion matters. Her efforts were not in vain and recently, she raised $2M in seed funding led by True Ventures. “At Atipica, we think of Inclusive AI in terms of data science, algorithms, and their ethical implications. This way you can rest assure our models are not replicating the biases of humans that hinder diversity while getting patent-pending aggregate demographic insights of your talent pool,” reads the website. She talks about her journey as a Latina woman in the tech industry. She reminisced on how she was the only one like her who got an internship with Hewlett Packard and the fact that she hated it. Nevertheless, she still decided to stay, determined not to let the industry turn her into a victim. She believes she made the right choice going forward with tech; now, years later, diversity is dominating the conversation in the industry. After HP, she also worked at Twitter and YouTube, helping them translate and localize their applications for a global audience. She is also a founding advisor of Project Include, which is a non-profit organization run by women, that uses data and advocacy to accelerate diversity and inclusion solutions in the tech industry. She opened her talk by agreeing to a quote from Safiya Noble, who wrote Algorithms of Oppression. “Artificial Intelligence will become a major human rights issue in the twenty-first century.” She believes we need to talk about difficult questions such as where AI is heading? And where should we hold ourselves and each other accountable.” She urges people to evaluate their role in AI, bias, and inclusion, to find the empathy and value in difficult conversations, and to go beyond your immediate surroundings to consider the broader consequences. It is important to build accountable AI in a way that allows humanity to triumph. She touched upon discriminatory moves by tech giants like Amazon and Google. Amazon recently killed off its AI recruitment tool because it couldn’t stop discriminating against women. She also criticized upon Facebook’s Myanmar operation where Facebook data scientists were building algorithms for hate speech. They didn’t understand the importance of localization or language or actually internationalize their own algorithms to be inclusive towards all the countries. She also talked about algorithmic bias in library discovery systems, as well as how even ‘black robots’ are being impacted by racism. She also condemned Palmer Luckey's work who is helping U.S. immigration agents on the border wall identify Latin refugees. Finally, she urged people to take three major steps to progress towards being inclusive: Be an ally Think of inclusion as an approach, not a feature Work towards an Ethical AI Head over to NeurIPS facebook page for the entire talk and other sessions happening at the conference this week. NeurIPS 2018: Deep learning experts discuss how to build adversarially robust machine learning models NeurIPS 2018 paper: DeepMind researchers explore autoregressive discrete autoencoders (ADAs) to model music in raw audio at scale NeurIPS 2018: A quick look at data visualization for Machine learning by Google PAIR researchers [Tutorial]

0
0
18166

article-image-experts-present-most-pressing-issues-facing-global-lawmakers-on-citizens-privacy-democracy-and-rights-to-freedom-of-speech

Sugandha Lahoti

31 May 2019

17 min read

Experts present most pressing issues facing global lawmakers on citizens’ privacy, democracy and rights to freedom of speech

Sugandha Lahoti

31 May 2019

17 min read

The Canadian Parliament's Standing Committee on Access to Information, Privacy, and Ethics are hosting a hearing on Big Data, Privacy and Democracy from Monday, May 27 to Wednesday, May 29 as a series of discussions with experts, and tech execs over the three days. The committee invited expert witnesses to testify before representatives from 12 countries ( Canada, United Kingdom, Singapore, Ireland, Germany, Chile, Estonia, Mexico, Morocco, Ecuador, St. Lucia, and Costa Rica) on how governments can protect democracy and citizen rights in the age of big data. The committee opened with a round table discussion where expert witnesses spoke about what they believe to be the most pressing issues facing lawmakers when it comes to protecting the rights of citizens in the digital age. Expert witnesses that took part were: Professor Heidi Tworek, University of British Columbia Jason Kint, CEO of Digital Content Next Taylor Owen, McGill University Ben Scott, The Center for Internet and Society, Stanford Law School Roger McNamee, Author of Zucked: Waking up to the Facebook Catastrophe Shoshana Zuboff, Author of The Age of Surveillance Capitalism Maria Ressa, Chief Executive Officer and Executive Editor of Rappler Inc. Jim Balsillie, Chair, Centre for International Governance Innovation The session was led by Bob Zimmer, M.P. and Chair of the Standing Committee on Access to Information, Privacy and Ethics. Other members included Nathaniel Erskine-Smith, and Charlie Angus, M.P. and Vice-Chair of the Standing Committee on Access to Information, Privacy and Ethics. Also present was Damian Collins, M.P. and Chair of the UK Digital, Culture, Media and Sport Committee. Testimonies from the witnesses “Personal data matters more than context”, Jason Kint, CEO of Digital Content Next The presentation started with Mr. Jason Kint, CEO of Digital Content Next, a US based Trade association, who thanked the committee and appreciated the opportunity to speak on behalf of 80 high-quality digital publishers globally. He begins by saying how DCN has prioritized shining a light on issues that erode trust in the digital marketplace, including a troubling data ecosystem that has developed with very few legitimate constraints on the collection and use of data about consumers. As a result personal data is now valued more highly than context, consumer expectations, copyright, and even facts themselves. He believes it is vital that policymakers begin to connect the dots between the three topics of the committee's inquiry, data privacy, platform dominance and, societal impact. He says that today personal data is frequently collected by unknown third parties without consumer knowledge or control. This data is then used to target consumers across the web as cheaply as possible. This dynamic creates incentives for bad actors, particularly on unmanaged platforms, like social media, which rely on user-generated content mostly with no liability. Here the site owners are paid on the click whether it is from an actual person or a bot on trusted information or on disinformation. He says that he is optimistic about regulations like the GDPR in the EU which contain narrow purpose limitations to ensure companies do not use data for secondary uses. He recommends exploring whether large tech platforms that are able to collect data across millions of devices, websites, and apps should even be allowed to use this data for secondary purposes. He also applauds the decision of the German cartel office to limit Facebook's ability to collect and use data across its apps and the web. He further says that issues such as bot fraud, malware, ad blockers, clickbait, privacy violations and now disinformation are just symptoms. The root cause is unbridled data collection at the most personal level. Four years ago DC ended the original financial analysis labeling Google and Facebook the duopoly of digital advertising. In a 150+ billion dollar digital ad market across the North America and the EU, 85 to 90 percent of the incremental growth is going to just these two companies. DNC dug deeper and connected the revenue concentration to the ability of these two companies to collect data in a way that no one else can. This means both companies know much of your browsing history and your location history. The emergence of this duopoly has created a misalignment between those who create the content and those who profit from it. The scandal involving Facebook and Cambridge analytic underscores the current dysfunctional dynamic. With the power Facebook has over our information ecosystem our lives and our democratic systems it is vital to know whether we can trust the company. He also points out that although, there's been a well documented and exhausting trail of apologies, there's been little or no change in the leadership or governance of Facebook. In fact the company has repeatedly refused to have its CEO offer evidence to pressing international government. He believes there should be a deeper probe as there's still much to learn about what happened and how much Facebook knew about the Cambridge Analytica scandal before it became public. Facebook should be required to have an independent audit of its user account practices and its decisions to preserve or purge real and fake accounts over the past decade. He ends his testimony saying that it is critical to shed light on these issues to understand what steps must be taken to improve data protection. This includes providing consumers with greater transparency and choice over their personal data when using practices that go outside of the normal expectations of consumers. Policy makers globally must hold digital platforms accountable for helping to build a healthy marketplace and for restoring consumer trust and restoring competition. “We need a World Trade Organization 2.0 “, Jim Balsillie, Chair, Centre for International Governance Innovation; Retired Chairman and co-CEO of BlackBerry Jim begins by saying that Data governance is the most important public policy issue of our time. It is cross-cutting with economic, social, and security dimension. It requires both national policy frameworks and international coordination. A specific recommendation he brought forward in this hearing was to create a new institution for like-minded nations to address digital cooperation and stability. “The data driven economies effects cannot be contained within national borders”, he said, “we need new or reformed rules of the road for digitally mediated global commerce, a World Trade Organization 2.0”. He gives the example of Financial Stability Board which was created in the aftermath of the 2008 financial crisis to foster global financial cooperation and stability. He recommends forming a similar global institution, for example, digital stability board, to deal with the challenges posed by digital transformation. The nine countries on this committee plus the five other countries attending, totaling 14 could constitute founding members of this board which would undoubtedly grow over time. “Check business models of Silicon Valley giants”, Roger McNamee, Author of Zucked: Waking up to the Facebook Catastrophe Roger begins by saying that it is imperative that this committee and that nations around the world engage in a new thought process relative to the ways of controlling companies in Silicon Valley, especially to look at their business models. By nature these companies invade privacy and undermine democracy. He assures that there is no way to stop that without ending the business practices as they exist. He then commends Sri Lanka who chose to shut down the platforms in response to a terrorist act. He believes that that is the only way governments are going to gain enough leverage in order to have reasonable conversations. He explains more on this in his formal presentation, which took place yesterday. “Stop outsourcing policies to the private sector”, Taylor Owen, McGill University He begins by making five observations about the policy space that we’re in right now. First, self-regulation and even many of the forms of co-regulation that are being discussed have and will continue to prove insufficient for this problem. The financial incentives are simply powerfully aligned against meaningful reform. These are publicly traded largely unregulated companies whose shareholders and directors expect growth by maximizing a revenue model that it is self part of the problem. This growth may or may not be aligned with the public interest. Second, disinformation, hate speech, election interference, privacy breaches, mental health issues and anti-competitive behavior must be treated as symptoms of the problem not its cause. Public policy should therefore focus on the design and the incentives embedded in the design of the platforms themselves. If democratic governments determine that structure and design is leading to negative social and economic outcomes, then it is their responsibility to govern. Third, governments that are taking this problem seriously are converging on a markedly similar platform governance agenda. This agenda recognizes that there are no silver bullets to this broad set of problems and that instead, policies must be domestically implemented and internationally coordinated across three categories: Content policies which seek to address a wide range of both supply and demand issues about the nature amplification and legality of content in our digital public sphere. Data policies which ensure that public data is used for the public good and that citizens have far greater rights over the use, mobility, and monetization of their data. Competition policies which promote free and competitive markets in the digital economy. Fourth, the propensity when discussing this agenda to overcomplicate solutions serves the interests of the status quo. He then recommends sensible policies that could and should be implemented immediately: The online ad micro targeting market could be made radically more transparent and in many cases suspended entirely. Data privacy regimes could be updated to provide far greater rights to individuals and greater oversight and regulatory power to punish abuses. Tax policy can be modernized to better reflect the consumption of digital goods and to crack down on tax base erosion and profit sharing. Modernized competition policy can be used to restrict and rollback acquisitions and a separate platform ownership from application and product development. Civic media can be supported as a public good. Large-scale and long term civic literacy and critical thinking efforts can be funded at scale by national governments, not by private organizations. He then asks difficult policy questions for which there are neither easy solutions, meaningful consensus nor appropriate existing international institutions. How we regulate harmful speech in the digital public sphere? He says, that at the moment we've largely outsourced the application of national laws as well as the interpretation of difficult trade-offs between free speech and personal and public harms to the platforms themselves. Companies who seek solutions rightly in their perspective that can be implemented at scale globally. In this case, he argues that what is possible technically and financially for the companies might be insufficient for the goals of the public good or the public policy goals. What is liable for content online? He says that we’ve clearly moved beyond the notion of platform neutrality and absolute safe harbor but what legal mechanisms are best suited to holding platforms, their design, and those that run them accountable. Also, he asks how are we going to bring opaque artificial intelligence systems into our laws and norms and regulations? He concludes saying that these difficult conversation should not be outsourced to the private sector. They need to be led by democratically accountable governments and their citizens. “Make commitments to public service journalism”, Ben Scott, The Center for Internet and Society, Stanford Law School Ben states that technology doesn't cause the problem of data misinformation, and irregulation. It infact accelerates it. This calls for policies to be made to limit the exploitation of these technology tools by malignant actors and by companies that place profits over the public interest. He says, “we have to view our technology problem through the lens of the social problems that we're experiencing.” This is why the problem of political fragmentation or hate speech tribalism and digital media looks different in each countries. It looks different because it feeds on the social unrest, the cultural conflict, and the illiberalism that is native to each society. He says we need to look at problems holistically and understand that social media companies are a part of a system and they don't stand alone as the super villains. The entire media market has bent itself to the performance metrics of Google and Facebook. Television, radio, and print have tortured their content production and distribution strategies to get likes shares and and to appear higher in the Google News search results. And so, he says, we need a comprehensive public policy agenda and put red lines around the illegal content. To limit data collection and exploitation we need to modernize competition policy to reduce the power of monopolies. He also says, that we need to publicly educate people on how to help themselves and how to stop being exploited. We need to make commitments to public service journalism to provide alternatives for people, alternatives to the mindless stream of clickbait to which we have become accustomed. “Pay attention to the physical infrastructure”, Professor Heidi Tworek, University of British Columbia Taking inspiration from Germany's vibrant interwar media democracy as it descended into an authoritarian Nazi regime, Heidi lists five brief lessons that she thinks can guide policy discussions in the future. These can enable governments to build robust solutions that can make democracies stronger. Disinformation is also an international relations problem Information warfare has been a feature not a bug of the international system for at least a century. So the question is not if information warfare exists but why and when states engage in it. This happens often when a state feels encircled, weak or aspires to become a greater power than it already is. So if many of the causes of disinformation are geopolitical, we need to remember that many of the solutions will be geopolitical and diplomatic as well, she adds. Pay attention to the physical infrastructure Information warfare disinformation is also enabled by physical infrastructure whether it is the submarine cables a century ago or fiber optic cables today. 95 to 99 percent of international data flows through undersea fiber-optic cables. Google partly owns 8.5 percent of those submarine cables. Content providers also own physical infrastructure She says, Russia and China, for example are surveying European and North American cables. China we know as of investing in 5G but combining that with investments in international news networks. Business models matter more than individual pieces of content Individual harmful content pieces go viral because of the few companies that control the bottleneck of information. Only 29% of Americans or Brits understand that their Facebook newsfeed is algorithmically organized. The most aware are the Finns and there are only 39% of them that understand that. That invisibility can provide social media platforms an enormous amount of power that is not neutral. At a very minimum, she says, we need far more transparency about how algorithms work and whether they are discriminatory. Carefully design robust regulatory institutions She urges governments and the committee to democracy-proof whatever solutions, come up with. She says, “we need to make sure that we embed civil society or whatever institutions we create.” She suggests an idea of forming social media councils that could meet regularly to actually deal with many such problems. The exact format and the geographical scope are still up for debate but it's an idea supported by many including the UN Special Rapporteur on freedom of expression and opinion, she adds. Address the societal divisions exploited by social media Heidi says, that the seeds of authoritarianism need fertile soil to grow and if we do not attend to the underlying economic and social discontents, better communications cannot obscure those problems forever. “Misinformation is effect of one shared cause, Surveillance Capitalism”, Shoshana Zuboff, Author of The Age of Surveillance Capitalism Shoshana also agrees with the committee about how the themes of platform accountability, data security and privacy, fake news and misinformation are all effects of one shared cause. She identifies this underlying cause as surveillance capitalism and defines surveillance capitalism as a comprehensive systematic economic logic that is unprecedented. She clarifies that surveillance capitalism is not technology. It is also not a corporation or a group of corporations. This is infact a virus that has infected every economic sector from insurance, retail, publishing, finance all the way through to product and service manufacturing and administration all of these sectors. According to her, Surveillance capitalism cannot also be reduced to a person or a group of persons. Infact surveillance capitalism follows the history of market capitalism in the following way - it takes something that exists outside the marketplace and it brings it into the market dynamic for production and sale. It claims private human experience for the market dynamic. Private human experience is repurposed as free raw material which are rendered as behavioral data. Some of these behavioral data are certainly fed back into product and service improvement but the rest are declared of behavioral surplus identified for their rich predictive value. These behavioral surplus flows are then channeled into the new means of production what we call machine intelligence or artificial intelligence. From these come out prediction products. Surveillance capitalists own and control not one text but two. First is the public facing text which is derived from the data that we have provided to these entities. What comes out of these, the prediction products, is the proprietary text, a shadow text from which these companies have amassed high market capitalization and revenue in a very short period of time. These prediction products are then sold into a new kind of marketplace that trades exclusively in human futures. The first name of this marketplace was called online targeted advertising and the human predictions that were sold in those markets were called click-through rates. By now that these markets are no more confined to that kind of marketplace. This new logic of surveillance capitalism is being applied to anything and everything. She promises to discuss on more of this in further sessions. “If you have no facts then you have no truth. If you have no truth you have no trust”, Maria Ressa, Chief Executive Officer and Executive Editor of Rappler Inc. Maria believes that in the end it comes down to the battle for truth and journalists are on the front line of this along with activists. Information is power and if you can make people believe lies, then you can control them. Information can be used for commercial benefits as well as a means to gain geopolitical power. She says, If you have no facts then you have no truth. If you have no truth you have no trust. She then goes on to introduce a bit about her formal presentation tomorrow saying that she will show exactly how quickly a nation, a democracy can crumble because of information operations. She says she will provide data that shows it is systematic and that it is an erosion of truth and trust. She thanks the committee saying that what is so interesting about these types of discussions is that the countries that are most affected are democracies that are most vulnerable. Bob Zimmer concluded the meeting saying that the agenda today was to get the conversation going and more of how to make our data world a better place will be continued in further sessions. He said, “as we prepare for the next two days of testimony, it was important for us to have this discussion with those who have been studying these issues for years and have seen firsthand the effect digital platforms can have on our everyday lives. The knowledge we have gained tonight will no doubt help guide our committee as we seek solutions and answers to the questions we have on behalf of those we represent. My biggest concerns are for our citizens’ privacy, our democracy and that our rights to freedom of speech are maintained according to our Constitution.” Although, we have covered most of the important conversations, you can watch the full hearing here. Time for data privacy: DuckDuckGo CEO Gabe Weinberg in an interview with Kara Swisher ‘Facial Recognition technology is faulty, racist, biased, abusive to civil rights; act now to restrict misuse’ say experts to House Oversight and Reform Committee. A brief list of drafts bills in US legislation for protecting consumer data privacy

0
0
18054

article-image-tinkering-with-ticks-in-matplotlib-2-0

Sugandha Lahoti

13 Dec 2017

6 min read

Tinkering with ticks in Matplotlib 2.0

Sugandha Lahoti

13 Dec 2017

6 min read

[box type="note" align="" class="" width=""]This is an excerpt from the book titled Matplotlib 2.x By Example written by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim,. The book covers basic know-how on how to create and customize plots by Matplotlib. It will help you learn to visualize geographical data on maps and implement interactive charts. [/box] The article talks about how you can manipulate ticks in Matplotlib 2.0. It includes steps to adjust tick spacing, customizing tick formats, trying out the ticker locator and formatter, and rotating tick labels. What are Ticks Ticks are dividers on an axis that help readers locate the coordinates. Tick labels allow estimation of values or, sometimes, labeling of a data series as in bar charts and box plots. Adjusting tick spacing Tick spacing can be adjusted by calling the locator methods: ax.xaxis.set_major_locator(xmajorLocator) ax.xaxis.set_minor_locator(xminorLocator) ax.yaxis.set_major_locator(ymajorLocator) ax.yaxis.set_minor_locator(yminorLocator) Here, ax refers to axes in a Matplotlib figure. Since set_major_locator() or set_minor_locator() cannot be called from the pyplot interface but requires an axis, we call pyplot.gca() to get the current axes. We can also store a figure and axes as variables at initiation, which is especially useful when we want multiple axes. Removing ticks NullLocator: No ticks Drawing ticks in multiples Spacing ticks in multiples of a given number is the most intuitive way. This can be done by using MultipleLocator space ticks in multiples of a given value. Automatic tick settings MaxNLocator: This finds the maximum number of ticks that will display nicely AutoLocator: MaxNLocator with simple defaults AutoMinorLocator: Adds minor ticks uniformly when the axis is linear Setting ticks by the number of data points IndexLocator: Sets ticks by index (x = range(len(y)) Set scaling of ticks by mathematical functions LinearLocator: Linear scale LogLocator: Log scale SymmetricalLogLocator: Symmetrical log scale, log with a range of linearity LogitLocator: Logit scaling Locating ticks by datetime There is a series of locators dedicated to displaying date and time: MinuteLocator: Locate minutes HourLocator: Locate hours DayLocator: Locate days of the month WeekdayLocator: Locate days of the week MonthLocator: Locate months, for example, 8 for August YearLocator: Locate years that in multiples RRuleLocator: Locate using matplotlib.dates.rrulewrapper The rrulewrapper is a simple wrapper around a dateutil.rrule (dateutil) that allows almost arbitrary date tick specifications AutoDateLocator: On autoscale, this class picks the best MultipleDateLocator to set the view limits and the tick locations Customizing tick formats Tick formatters control the style of tick labels. They can be called to set the major and minor tick formats on the x and y axes as follows: ax.xaxis.set_major_formatter( xmajorFormatter ) ax.xaxis.set_minor_formatter( xminorFormatter ) ax.yaxis.set_major_formatter( ymajorFormatter ) ax.yaxis.set_minor_formatter( yminorFormatter ) Removing tick labels NullFormatter: No tick labels Fixing labels FixedFormatter: Labels are set manually Setting labels with strings IndexFormatter: Take labels from a list of strings StrMethodFormatter: Use the string format method Setting labels with user-deﬁned functions FuncFormatter: Labels are set by a user-defined function Formatting axes by numerical values ScalarFormatter: The format string is automatically selected for scalars by default The following formatters set values for log axes: LogFormatter: Basic log axis LogFormatterExponent: Log axis using exponent = log_base(value) LogFormatterMathtext: Log axis using exponent = log_base(value) using Math text LogFormatterSciNotation: Log axis with scientific notation LogitFormatter: Probability formatter Trying out the ticker locator and formatter To demonstrate the ticker locator and formatter, here we use Netflix subscriber data as an example. Business performance is often measured seasonally. Television shows are even more "seasonal". Can we better show it in the timeline? import numpy as np import matplotlib.pyplot as plt import matplotlib.ticker as ticker """ Number for Netflix streaming subscribers from 2012-2017 Data were obtained from Statista on https://www.statista.com/statistics/250934/quarterly-number-of-netflix-stre aming-subscribers-worldwide/ on May 10, 2017. The data were originally published by Netflix in April 2017. """ # Prepare the data set x = range(2011,2018) y = [26.48,27.56,29.41,33.27,36.32,37.55,40.28,44.35, 48.36,50.05,53.06,57.39,62.27,65.55,69.17,74.76,81.5, 83.18,86.74,93.8,98.75] # quarterly subscriber count in millions # Plot lines with different line styles plt.plot(y,'^',label = 'Netflix subscribers',ls='-') # get current axes and store it to ax ax = plt.gca() # set ticks in multiples for both labels ax.xaxis.set_major_locator(ticker.MultipleLocator(4)) # set major marks # every 4 quarters, ie once a year ax.xaxis.set_minor_locator(ticker.MultipleLocator(1)) # set minor marks # for each quarter ax.yaxis.set_major_locator(ticker.MultipleLocator(10)) # ax.yaxis.set_minor_locator(ticker.MultipleLocator(2)) # label the start of each year by FixedFormatter ax.get_xaxis().set_major_formatter(ticker.FixedFormatter(x)) plt.legend() plt.show() From this plot, we see that Netflix has a pretty linear growth of subscribers from the year 2012 to 2017. We can tell the seasonal growth better after formatting the x axis in a quarterly manner. In 2016, Netflix was doing better in the latter half of the year. Any TV shows you watched in each season? Rotating tick labels A figure can get too crowded or some tick labels may get skipped when we have too many tick labels or if the label strings are too long. We can solve this by rotating the ticks, for example, by pyplot.xticks(rotation=60): import matplotlib.pyplot as plt import numpy as np import matplotlib as mpl mpl.style.use('seaborn') techs = ['Google Adsense','DoubleClick.Net','Facebook Custom Audiences','Google Publisher Tag', 'App Nexus'] y_pos = np.arange(len(techs)) # Number of websites using the advertising technologies # Data were quoted from builtwith.com on May 8th 2017 websites = [14409195,1821385,948344,176310,283766] plt.bar(y_pos, websites, align='center', alpha=0.5) # set x-axis tick rotation plt.xticks(y_pos, techs, rotation=25) plt.ylabel('Live site count') plt.title('Online advertising technologies usage') plt.show() Use pyplot.tight_layout() to avoid image clipping. Using rotated labels can sometimes result in image clipping, as follows, if you save the figure by pyplot.savefig(). You can call pyplot.tight_layout() before pyplot.savefig() to ensure a complete image output. We saw how ticks can be adjusted, customized, rotated and formatted in Matplotlib 2.0 for easy readability, labelling and estimation of values. To become well-versed with Matplotlib for your day to day work, check out this book Matplotlib 2.x By Example.

0
0
18047

How-To Tutorials - Data

Building A Recommendation System with Azure

AI Now Institute publishes a report on the diversity crisis in AI and offers 12 solutions to fix it

Connecting your data to MongoDB using PyMongo and PHP

Oracle 12c SQL and PL/SQL New Features

How Facebook is advancing artificial intelligence [Video]

NIPS 2017 Special: A deep dive into Deep Bayesian and Bayesian Deep Learning with Yee Whye Teh

Thomas Munro from EnterpriseDB on parallelism in PostgreSQL

Why are experts worried about Microsoft's billion dollar bet in OpenAI's AGI pipe dream?

Facebook research suggests chatbots and conversational AI are on the verge of empathizing with humans

Introduction to Clustering and Unsupervised Learning

Trending Topics

Getting started with the Confluent Platform: Apache Kafka for enterprise

How to prevent errors while using utilities for loading data in Teradata

Accountability and algorithmic bias: Why diversity and inclusion matters [NeurIPS Invited Talk]

Experts present most pressing issues facing global lawmakers on citizens’ privacy, democracy and rights to freedom of speech

Tinkering with ticks in Matplotlib 2.0

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access