Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh
Savia Lobo
15 Dec 2017
8 min read
Save for later

NIPS 2017 Special: A deep dive into Deep Bayesian and Bayesian Deep Learning with Yee Whye Teh

Savia Lobo
15 Dec 2017
8 min read
Yee Whye Teh is a professor at the department of Statistics of the University of Oxford and also a research scientist at DeepMind. He works on statistical machine learning, focussing on Bayesian nonparametrics, probabilistic learning, and deep learning. The motive of this article aims to bring our readers to Yee’s keynote speech at the NIPS 2017. Yee’s keynote ponders deeply on the interface between two perspectives on machine learning: Bayesian learning and Deep learning by exploring questions like: How can probabilistic thinking help us understand deep learning methods or lead us to interesting new methods? Conversely, how can deep learning technologies help us develop advanced probabilistic methods? For a more comprehensive and in-depth understanding of this novel approach, be sure to watch the complete keynote address by Yee Whye Teh on  NIPS facebook page. All images in this article come from Yee’s presentation slides and do not belong to us. The history of machine learning has shown a growth in both model complexity and in model flexibility. The theory led models have started to lose their shine. This is because machine learning is at the forefront of a revolution that could be called as data led models or the data revolution. As opposed to theory led models, data-led models try not to impose too many assumptions on the processes that have to be modeled and are rather superflexible non-parametric models that can capture the complexities but they require large amount of data to operate.   On the model flexibility side, we have various approaches that have been explored over the years. We have kernel methods, Gaussian processes, Bayesian nonparametrics and now we have deep learning as well. The community has also developed evermore complex frameworks both graphical and programmatic to compose large complex models from simpler building blocks. In the 90’s we had graphical models, later we had probabilistic programming systems, followed by deep learning systems like TensorFlow, Theano, and Torch. A recent addition is probabilistic Torch, which brings together ideas from both the probabilistic Bayesian learning and deep learning. On one hand we have Bayesian learning, which deals with learning as inference in some probabilistic models. On the other hand we have deep learning models, which view learning as optimization functions parametrized by neural networks. In recent years there has been an explosion of exciting research at this interface of these two popular approaches resulting in increasingly complex and exciting models. What is Bayesian theory of learning Bayesian learning describes an ideal learner as one who interacts with the world in order to know its state, which is given by θ. He/she makes some observations about the world by deducing a model in Bayesian context. This model is a joint distribution of both the unknown state of the world θ and the observation about the world x. The model consists of prior distribution and marginal distribution, combining which gives a reverse conditional distribution also known as posterior, which describes the totality of the agent's knowledge about the world after he/she sees x. This posterior can also be used for predicting future observations and act accordingly. Issues associated with Bayesian learning Rigidity Learning can be wrong if model is wrong Not all prior knowledge can be encoded as joint distribution Simple analytic forms are limiting for conditional distributions 2. Scalability: Intractable to compute this posterior and approximations have to be made, which then introduces trade offs between efficiency and accuracy. As a result, it is often assumed that Bayesian techniques are not scalable. To address these issues, the speaker highlights some of his recent projects which showcase scenarios where deep learning ideas are applied to Bayesian models (Deep Bayesian learning) or in the reverse applying Bayesian ideas to Neural Networks ( i.e. Bayesian Deep learning) Deep Bayesian learning: Deep learning assists Bayesian learning Deep learning can improve Bayesian learning in the following ways: Improve the modeling flexibility by using neural networks in the construction of Bayesian models Improve the inference and scalability of these methods by parameterizing the posterior way of using neural networks Empathizing inference over multiple runs These can be seen in the following projects showcased by Yee: Concrete VAEs(Variational Autoencoders) FIVO: Filtered Variational Objectives Concrete VAEs What are VAEs? All the qualities mentioned above, i.e. improving modeling flexibility, improving inference and scalability, and empathizing inference over multiple runs by using neural networks can be seen in a class of deep generative models known as VAE (Variational Autoencoders). Fig: Variational Autoencoders VAEs include latent variables that describe the contents of a scene i.e objects, pose. The relationship between these latent variables and the pixels have to be highly complex and nonlinear. So, in short, VAEs are used to parameterize generative and variable posterior distribution that allows for greater scope flexible modeling. The key that makes VAEs work is the reparameterization trick Fig: Adding reparameterization to VAEs The reparameterization trick is crucial to the continuous latent variables in the VAEs. But many models naturally include discrete latent variables. Yee suggests application of the reparameterization on the discrete latent variables as a work around. This brings us to the concept of Concrete VAEs.. CONtinuous relaxation of disCRETE distributions.Also, the density can be further calculated: This concrete distribution is the reparameterization trick for discrete variables which helps in calculating the KL divergence that is needed for variational inference. FIVO: Filtered Variational Objectives FIVO extends VAEs towards models for sequential and time series data. It is built upon another extension of VAEs known as Importance Weighted Autoencoder, a generative model with a similar as that of the VAE, but which uses a strictly tighter log-likelihood lower bound. Variational lower bound: Rederivation from importance sampling: Better to use multiple samples: Using Importance Weighted Autoencoders we can use multiple sampling, with which we can get a tighter lower bound and optimizing this lower bound should lead to better learning. Let’s have a look at the FIVO objectives: We can use any unbiased estimator p(X) of marginal probabilityTightness of bound related to variance of estimatorFor sequential models, we can use particle filters which produce unbiased estimator of marginal probability. They can also have much lower variance than importance samplers. Bayesian Deep learning: Bayesian approach for deep learning gives us counterintuitive and surprising ways to make deep learning scalable. In order to explore the potential of Bayesian learning with deep neural networks, Yee introduced a project named, The posterior server. The Posterior server The posterior server is a distributed server for deep learning. It makes use of the Bayesian approach in order to make neural networks highly scalable. This project focuses on Distributed learning, where both the data and the computations can be spread across the network. The figure above shows that there are a bunch of workers and each communicates with the parameter server, which effectively maintains the authoritative copy of the parameters of the network. At each iteration, each worker obtains the latest copy of the parameter from the server, computes the gradient update based on its data and sends it back to the server which then updates it to the authoritative copy. So, communications on the network tend to be slower than the computations that can be done on the network. Hence, one might consider multiple gradient steps on each iteration before it sends the accumulated update back to the parameter server. The problem is that the parameter and the worker quickly get out of sync with the authoritative copy on the parameter server. As a result, this leads to stale updates which allow noise into the system and we often need frequent synchronizations across the network for the algorithm to learn in a stable fashion. The main idea here in Bayesian context is that we don't just want a single parameter, we want a whole distribution over them. This will then relax the need for frequent synchronizations across the network and hopefully lead to algorithms that are robust to last frequent communication. Each worker is simply going to construct its own tractable approximation to his own likelihood function and send this information to the posterior server which then combines these approximations together to form the full posterior or an approximation of it. Further, the approximations that are constructed would be based on the statistics of some sampling algorithms that happens locally on that worker. The actual algorithm includes a combination of the variational algorithms, Stochastic Gradient EP and the Markov chain Monte Carlo on the workers themselves. So the variational part in the algorithm handles the communication part in the network whereas the MCMC part handles the sampling part that is posterior to construct the statistics that the variational part needs. For scalability, a stochastic gradient Langevin algorithm which is a simple generalization of the SGT, which includes additional injected noise, to sample from posterior noise. To experiment with this server, it was trained densely connected neural networks with 500 reLU units on MNIST dataset. You can have a detailed understanding of these examples in the keynote video. This interface between Bayesian learning and deep learning is a very exciting frontier. Researchers have brought management of uncertainties within deep learning. Also, flexibility and scalability in Bayesian modeling. Yee concludes with two questions for the audience to think about. Does being Bayesian in the space of functions makes more sense than being Bayesian in the sense of parameters? How to deal with uncertainties under model misspecification?    
Read more
  • 0
  • 0
  • 18628

article-image-thomas-munro-from-enterprisedb-on-parallelism-in-postgresql
Bhagyashree R
17 Dec 2019
7 min read
Save for later

Thomas Munro from EnterpriseDB on parallelism in PostgreSQL

Bhagyashree R
17 Dec 2019
7 min read
PostgreSQL is a powerful, open-source object-relational database system. Since its introduction, it has been well-received by developers for its reliability, feature robustness, data-integrity, better licensing, and much more. However, one of its limitations has been the lack of support for parallelism, which changed in the subsequent releases. At PostgresOpen 2018, Thomas Munro, a programmer at EnterpriseDB and PostgreSQL contributor talked about how parallelism has evolved in PostgreSQL over the years. In this article, we will see some of the key parallelism-specific features that Munro discussed in his talk. [box type="shadow" align="" class="" width=""] Further Learning This article gives you a glimpse of query parallelism in PostgreSQL. If you want to explore it further along with other concepts like data replication, and database performance, check out our book Mastering PostgreSQL 11 - Second Edition by Hans-Jürgen Schönig. This second edition of Mastering PostgreSQL 11 helps you build dynamic database solutions for enterprise applications using PostgreSQL, which enables database analysts to design both the physical and technical aspects of the system architecture with ease. [/box] Evolution of parallelism in PostgreSQL PostgreSQL uses a process-based architecture instead of a thread-based one. On startup, it launches a “postmaster” process and after that creates a new process for every database session. Previously, it did not support parallelism in a single connection and each query used to run serially. The absence of “intra-query parallelism” in PostgreSQL was a huge limitation for answering the queries faster. Parallelism here means allowing a single process to have multiple threads to query the system and utilize the increasing CPU core counts. The foundation for parallelism in PostgreSQL was laid out in the 9.4 and 9.5 releases. These came with infrastructure updates like dynamic shared memory segments, shared memory queues, and background workers. PostgreSQL 9.6 was actually the first release that came with user-visible features for parallel query execution. It supported executor nodes: gather, parallel sequential scan, partial aggregate, and finalize aggregate. However, this was not enabled by default. Then in 2017, PostgreSQL 10 was released, which had parallelism enabled by default. It had a few more executor nodes including gather merge, parallel index scan, and parallel bitmap heap scan. Last year, PostgreSQL 11 came out with a couple of more executor nodes including parallel append and parallel hash join. It also introduced partition-wise joins and parallel CREATE INDEX. Key parallelism-specific features in PostgreSQL Parallel sequential scans Parallel sequential scans was the very first feature for parallel query execution. Introduced in PostgreSQL 9.6, this scan distributes blocks of a table among different processes. This assignment is done one after the other to ensure that the access to the table remains sequential. The processes that run in parallel and scan the tuples of a table are called parallel workers. There is one special worker called leader, which is responsible for coordinating and collecting the output of the scan from each of the worker. The leader may or may not participate in scanning the database depending on its load in dividing and combining processes. Parallel index scan Parallel index scan is based on the same concept as parallel sequential scan, but it involves more communication and waiting. Currently, the parallel index scans are supported only for B-Tree indexes. In a parallel index scan, index pages are scanned in parallel. Each process will scan a single index block and return all tuples referenced by that block. Meanwhile, other processes will also scan different index blocks and return the tuples. The results of a parallel B-Tree scan are then returned in sorted order. Parallel bitmap heap scan Again, this also has the same concept as the parallel sequential scan. Explaining the difference, Munro said, “You’ve got a big bitmap and you are skipping ahead to the pages that contain interesting tuples.” In parallel bitmap heap scan, one process is chosen as the leader, who performs a scan of one or more indexes and creates bitmap indicating which table blocks need to be visited. These table blocks are then divided among the worker processes as in a parallel sequential scan. Here the heap scan is done in parallel, but the underlying index scan is not. Parallel joins PostgreSQL supports all three join strategies in parallel query plans: nested loop join, hash join, or merge join. However, there is no parallelism supported in the inner loop. The entire loop is scanned as a whole, and the parallelism comes into play when each worker executes the inner loop as a whole. The results of each join are sent to gather node to produce the final results. Nested loop join: The nested loop is the most basic way for PostgreSQL to perform a join. Though it is considered to be slow, it can be efficient if the inner side is an index scan. This is because the outer tuples and hence the loops that loop up values in the index will be divided among worker processes. Merge join: The inner side is executed in full. It can be inefficient when sort needs to be performed because the work and resulting data are duplicated in every cooperating process. Hash join: In this join as well, the inner side is executed in full by every worker process to build identical copies of the hash table. It is inefficient in cases when the hash table is large or the plan is expensive. However, in parallel hash join, the inner side is a parallel hash that divides the work of building a shared hash table over the cooperating processes. This is the only join in which we can have parallelism on both sides. Partition-wise join Partition-wise join is a new feature introduced in PostgreSQL 11. In partition-wise join, the planner knows that both sides of the join have matching partition schemes. Here a join between two similarly partitioned tables are broken down into joins between their matching partitions if there is an equi-join condition between the partition key of joining tables. Munro explains, “It becomes parallelizable with the advent of parallel append, which can then run different branches of that query plan in different processes. But if you do that then granularity of parallelism is partitioned, which is in some ways good and in some ways bad compared to block-based granularity.” He further adds, “It means when the last worker runs out of work to do everyone else has to wait for that before the query is finished. Whereas, if you use block-based parallelism you don’t have the problem but there are some advantages as a result of that as well.” Parallel aggregation in PostgreSQL Calculating aggregates can be very expensive and when evaluated in a single process it could take a considerable amount of time. This problem was solved in PostgreSQL 9.6 with the introduction of parallel aggregation. This is essentially a divide and conquer strategy where multiple workers calculate a part of aggregate before the final value based on these calculations is calculated by the leader. This article walked you through some of the parallelism-specific features in PostgreSQL presented by Munro in his PostgresOpen 2018 talk.  If you want to get to grips with other advanced PostgreSQL features and SQL functions, do have a look at our Mastering PostgreSQL 11 - Second Edition book by Hans-Jürgen Schönig. By the end of this book, you will be able to use your database to its utmost capacity by implementing advanced administrative tasks with ease. PostgreSQL committer Stephen Frost shares his vision for PostgreSQL version 12 and beyond Introducing PostgREST, a REST API for any PostgreSQL database written in Haskell Percona announces Percona Distribution for PostgreSQL to support open source databases 
Read more
  • 0
  • 0
  • 18587

article-image-why-are-experts-worried-about-microsofts-billion-dollar-bet-in-openais-agi-pipe-dream
Sugandha Lahoti
23 Jul 2019
6 min read
Save for later

Why are experts worried about Microsoft's billion dollar bet in OpenAI's AGI pipe dream?

Sugandha Lahoti
23 Jul 2019
6 min read
Microsoft has invested $1 billion in OpenAI with the goal of building next-generation supercomputers and a platform within Microsoft Azure which will scale to AGI (Artificial General Intelligence). This is a multiyear partnership with Microsoft becoming OpenAI’s preferred partner for commercializing new AI technologies. Open AI will become a big Azure customer, porting its services to run on Microsoft Azure. The $1 billion is a cash investment into OpenAI LP, which is Open AI’s for-profit corporate subsidiary. The investment will follow a standard capital commitment structure which means OpenAI can call for it, as they need it. But the company plans to spend it in less than five years. Per the official press release, “The companies will focus on building a computational platform in Azure for training and running advanced AI models, including hardware technologies that build on Microsoft’s supercomputing technology. These will be implemented in a safe, secure and trustworthy way and is a critical reason the companies chose to partner together.” They intend to license some of their pre-AGI technologies, with Microsoft becoming their preferred partner. “My goal in running OpenAI is to successfully create broadly beneficial A.G.I.,” Sam Altman, who co-founded Open AI with Elon Musk, said in a recent interview. “And this partnership is the most important milestone so far on that path.” Musk left the company in February 2019, to focus on Tesla and because he didn’t agree with some of what OpenAI team wanted to do. What does this partnership mean for Microsoft and Open AI OpenAI may benefit from this deal by keeping their innovations private which may help commercialization, raise more funds and get to AGI faster. For OpenAI this means the availability of resources for AGI, while potentially allowing founders and other investors with the opportunity to either double-down on OpenAI or reallocate resources to other initiatives However, this may also lead to them not disclosing progress, papers with details, and open source code as much as in the past. https://twitter.com/Pinboard/status/1153380118582054912 As for Microsoft, this deal is another attempt in quietly taking over open source. First, with the acquisition of GitHub and the subsequent launch of GitHub Sponsors, and now with becoming OpenAI’s ‘preferred partner’ for commercialization. Last year at an Investor conference, Nadella said, “AI is going to be one of the trends that is going to be the next big shift in technology. It's going to be AI at the edge, AI in the cloud, AI as part of SaaS applications, AI as part of in fact even infrastructure. And to me, to be the leader in it, it's not enough just to sort of have AI capability that we can exercise—you also need the ability to democratize it so that every business can truly benefit from it. That to me is our identity around AI.” Partnership with OpenAI seems to be a part of this plan. This deal can also possibly help Azure catch up with Google and Amazon both in hardware scalability and Artificial Intelligence offerings. A hacker news user comments, “OpenAI will adopt and make Azure their preferred platform. And Microsoft and Azure will jointly "develop new Azure AI supercomputing technologies", which I assume is advancing their FGPA-based deep learning offering. Google has a lead with TensorFlow + TPUs and this is a move to "buy their way in", which is a very Microsoft thing to do.” https://twitter.com/soumithchintala/status/1153308199610511360 It is also likely that Microsoft is investing money which will eventually be pumped back into its own company, as OpenAI buys computing power from the tech giant. Under the terms of the contract, Microsoft will eventually become the sole cloud computing provider for OpenAI, and most of that $1 billion will be spent on computing power, Altman says. OpenAI, who were previously into building ethical AI will now pivot to build cutting edge AI and move towards AGI. Sometimes even neglecting ethical ramifications, wanting to deploy tech at the earliest which is what Microsoft would be interested in monetizing. https://twitter.com/CadeMetz/status/1153291410994532352 I see two primary motivations: For OpenAI—to secure funding and to gain some control over hardware which in turn helps differentiate software. For MSFT—to elevate Azure in the minds of developers for AI training. - James Wang, Analyst at ARKInvest https://twitter.com/jwangARK/status/1153338174871154689 However, the news of this investment did not go down well with some experts in the field who saw this as a pure commercial deal and questioned whether OpenAI’s switch to for-profit research undermines its claims to be “democratizing” AI. https://twitter.com/fchollet/status/1153489165595504640 “I can't really parse its conversion into an LP—and Microsoft's huge investment—as anything but a victory for capital” - Robin Sloan, Author https://twitter.com/robinsloan/status/1153346647339876352 “What is OpenAI? I don't know anymore.” - Stephen Merity, Deep learning researcher https://twitter.com/Smerity/status/1153364705777311745 https://twitter.com/SamNazarius/status/1153290666413383682 People are also speculating whether creating AGI is really even possible. In a recent survey experts estimated that there was a 50 percent chance of creating AGI by the year 2099. Pet New York Times, most experts believe A.G.I. will not arrive for decades or even centuries. Even Altman admits OpenAI may never get there. But the race is on nonetheless. Then why is Microsoft delivering the $1 billion over five years considering that is neither enough money nor enough time to produce AGI. Although, OpenAI has certainly impressed the tech community with its AI innovations. In April, OpenAI’s new algorithm that is trained to play the complex strategy game, Dota 2, beat the world champion e-sports team OG at an event in San Francisco, winning the first two matches of the ‘best-of-three’ series. The competition included a human team of five professional Dota 2 players and AI team of five OpenAI bots. In February, they released a new AI model GPT-2, capable of generating coherent paragraphs of text without needing any task-specific training. However experts felt that the move signalled towards ‘closed AI’ and propagated the ‘fear of AI’ for its ability to write convincing fake news from just a few words. Github Sponsors: Could corporate strategy eat FOSS culture for dinner? Microsoft is seeking membership to Linux-distros mailing list for early access to security vulnerabilities OpenAI: Two new versions and the output dataset of GPT-2 out!
Read more
  • 0
  • 0
  • 18493

article-image-facebook-research-suggests-chatbots-and-conversational-ai-will-empathize-humans
Fatema Patrawala
06 Aug 2019
6 min read
Save for later

Facebook research suggests chatbots and conversational AI are on the verge of empathizing with humans

Fatema Patrawala
06 Aug 2019
6 min read
Last week, the Facebook AI research team published a progress report on dialogue research that is fundamentally building more engageable and personalized AI systems. According to the team, “Dialogue research is a crucial component of building the next generation of intelligent agents. While there’s been progress with chatbots in single-domain dialogue, agents today are far from capable of carrying an open-domain conversation across a multitude of topics. Agents that can chat with humans in the way that people talk to each other will be easier and more enjoyable to use in our day-to-day lives — going beyond simple tasks like playing a song or booking an appointment.” In their blog post, they have described new open source data sets, algorithms, and models that improve five common weaknesses of open-domain chatbots today. The weaknesses identified are maintaining consistency, specificity, empathy, knowledgeability, and multimodal understanding. Let us look at each one in detail: Dataset called Dialogue NLI introduced for maintaining consistency Inconsistencies are a common issue for chatbots partly because most models lack explicit long-term memory and semantic understanding. Facebook team in collaboration with their colleagues at NYU, developed a new way of framing consistency of dialogue agents as natural language inference (NLI) and created a new NLI data set called Dialogue NLI, used to improve and evaluate the consistency of dialogue models. The team showcased an example in the Dialogue NLI model, where in they considered two utterances in a dialogue as the premise and hypothesis, respectively. Each pair was labeled to indicate whether the premise entails, contradicts, or is neutral with respect to the hypothesis. Training an NLI model on this data set and using it to rerank the model’s responses to entail previous dialogues — or maintain consistency with them — improved the overall consistency of the dialogue agent. Across these tests they say they saw 3x lesser contradictions in the sentences. Several conversational attributes were studied to balance specificity As per the team, generative dialogue models frequently default to generic, safe responses, like “I don’t know” to some query which needs specific responses. Hence, the Facebook team in collaboration with Stanford’s AI researcher Abigail See, studied how to fix this by controlling several conversational attributes, like the level of specificity. In one experiment, they conditioned a bot on character information and asked “What do you do for a living?” A typical chatbot responds with the generic statement “I’m a construction worker.” With control methods, the chatbots proposed more specific and engaging responses, like “I build antique homes and refurbish houses." In addition to specificity, the team mentioned, “that balancing question-asking and answering and controlling how repetitive our models are make significant differences. The better the overall conversation flow, the more engaging and personable the chatbots and dialogue agents of the future will be.” Chatbot’s ability to display empathy while responding was measured The team worked with researchers from the University of Washington to introduce the first benchmark task of human-written empathetic dialogues centered on specific emotional labels to measure a chatbot’s ability to display empathy. In addition to improving on automatic metrics, the team showed that using this data for both fine-tuning and as retrieval candidates leads to responses that are evaluated by humans as more empathetic, with an average improvement of 0.95 points (on a 1-to-5 scale) across three different retrieval and generative models. The next challenge for the team is that empathy-focused models should perform well in complex dialogue situations, where agents may require balancing empathy with staying on topic or providing information. Wikipedia dataset used to make dialogue models more knowledgeable The research team has improved dialogue models’ capability of demonstrating knowledge by collecting a data set with conversations from Wikipedia, and creating new model architectures that retrieve knowledge, read it, and condition responses on it. This generative model has yielded the most pronounced improvement and it is rated by humans as 26% more engaging than their knowledgeless counterparts. To engage with images, personality based captions were used To engage with humans, agents should not only comprehend dialogue but also understand images. In this research, the team focused on image captioning that is engaging for humans by incorporating personality. They collected a data set of human comments grounded in images, and trained models capable of discussing images with given personalities, which makes the system interesting for humans to talk to. 64% humans preferred these personality-based captions over traditional captions. To build strong models, the team considered both retrieval and generative variants, and leveraged modules from both the vision and language domains. They defined a powerful retrieval architecture, named TransResNet, that works by projecting the image, personality, and caption in the same space using image, personality, and text encoders. The team showed that their system was able to produce captions that are close to matching human performance in terms of engagement and relevance. And annotators preferred their retrieval model’s captions over captions written by people 49.5% of the time. Apart from this, Facebook team has released a new data collection and model evaluation tool, a Messenger-based Chatbot game called Beat the Bot, that allows people to interact directly with bots and other humans in real time, creating rich examples to help train models. To conclude, the Facebook AI team mentions, “Our research has shown that it is possible to train models to improve on some of the most common weaknesses of chatbots today. Over time, we’ll work toward bringing these subtasks together into one unified intelligent agent by narrowing and eventually closing the gap with human performance. In the future, intelligent chatbots will be capable of open-domain dialogue in a way that’s personable, consistent, empathetic, and engaging.” On Hacker News, this research has gained positive and negative reviews. Some of them discuss that if AI will converse like humans, it will do a lot of bad. While other users say that this is an impressive improvement in the field of conversational AI. A user comment reads, “I gotta say, when AI is able to converse like humans, a lot of bad stuff will happen. People are so used to the other conversation partner having self-interest, empathy, being reasonable. When enough bots all have a “swarm” program to move conversations in a particular direction, they will overwhelm any public conversation. Moreover, in individual conversations, you won’t be able to trust anything anyone says or negotiates. Just like playing chess or poker online now. And with deepfakes, you won’t be able to trust audio or video either. The ultimate shock will come when software can render deepfakes in realtime to carry on a conversation, as your friend but not. As a politician who “said crazy stuff” but really didn’t, but it’s in the realm of believability. I would give it about 20 years until it all goes to shit. If you thought fake news was bad, realtime deepfakes and AI conversations with “friends” will be worse.  Scroll Snapping and other cool CSS features come to Firefox 68 Google Chrome to simplify URLs by hiding special-case subdomains Lyft releases an autonomous driving dataset “Level 5” and sponsors research competition
Read more
  • 0
  • 0
  • 18468

article-image-introduction-clustering-and-unsupervised-learning
Packt
23 Feb 2016
16 min read
Save for later

Introduction to Clustering and Unsupervised Learning

Packt
23 Feb 2016
16 min read
The act of clustering, or spotting patterns in data, is not much different from spotting patterns in groups of people. In this article, you will learn: The ways clustering tasks differ from the classification tasks How clustering defines a group, and how such groups are identified by k-means, a classic and easy-to-understand clustering algorithm The steps needed to apply clustering to a real-world task of identifying marketing segments among teenage social media users Before jumping into action, we'll begin by taking an in-depth look at exactly what clustering entails. (For more resources related to this topic, see here.) Understanding clustering Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data. Without advance knowledge of what comprises a cluster, how can a computer possibly know where one group ends and another begins? The answer is simple. Clustering is guided by the principle that items inside a cluster should be very similar to each other, but very different from those outside. The definition of similarity might vary across applications, but the basic idea is always the same—group the data so that the related elements are placed together. The resulting clusters can then be used for action. For instance, you might find clustering methods employed in the following applications: Segmenting customers into groups with similar demographics or buying patterns for targeted marketing campaigns Detecting anomalous behavior, such as unauthorized network intrusions, by identifying patterns of use falling outside the known clusters Simplifying extremely large datasets by grouping features with similar values into a smaller number of homogeneous categories Overall, clustering is useful whenever diverse and varied data can be exemplified by a much smaller number of groups. It results in meaningful and actionable data structures that reduce complexity and provide insight into patterns of relationships. Clustering as a machine learning task Clustering is somewhat different from the classification, numeric prediction, and pattern detection tasks we examined so far. In each of these cases, the result is a model that relates features to an outcome or features to other features; conceptually, the model describes the existing patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label that has been inferred entirely from the relationships within the data. For this reason, you will, sometimes, see the clustering task referred to as unsupervised classification because, in a sense, it classifies unlabeled examples. The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return the groups A, B, and C—but it's up to you to apply an actionable and meaningful label. To see how this impacts the clustering task, let's consider a hypothetical example. Suppose you were organizing a conference on the topic of data science. To facilitate professional networking and collaboration, you planned to seat people in groups according to one of three research specialties: computer and/or database science, math and statistics, and machine learning. Unfortunately, after sending out the conference invitations, you realize that you had forgotten to include a survey asking which discipline the attendee would prefer to be seated with. In a stroke of brilliance, you realize that you might be able to infer each scholar's research specialty by examining his or her publication history. To this end, you begin collecting data on the number of articles each attendee published in computer science-related journals and the number of articles published in math or statistics-related journals. Using the data collected for several scholars, you create a scatterplot: As expected, there seems to be a pattern. We might guess that the upper-left corner, which represents people with many computer science publications but few articles on math, could be a cluster of computer scientists. Following this logic, the lower-right corner might be a group of mathematicians. Similarly, the upper-right corner, those with both math and computer science experience, may be machine learning experts. Our groupings were formed visually; we simply identified clusters as closely grouped data points. Yet in spite of the seemingly obvious groupings, we unfortunately have no way to know whether they are truly homogeneous without personally asking each scholar about his/her academic specialty. The labels we applied required us to make qualitative, presumptive judgments about the types of people that would fall into the group. For this reason, you might imagine the cluster labels in uncertain terms, as follows: Rather than defining the group boundaries subjectively, it would be nice to use machine learning to define them objectively. This might provide us with a rule in the form if a scholar has few math publications, then he/she is a computer science expert. Unfortunately, there's a problem with this plan. As we do not have data on the true class value for each point, a supervised learning algorithm would have no ability to learn such a pattern, as it would have no way of knowing what splits would result in homogenous groups. On the other hand, clustering algorithms use a process very similar to what we did by visually inspecting the scatterplot. Using a measure of how closely the examples are related, homogeneous groups can be identified. In the next section, we'll start looking at how clustering algorithms are implemented. This example highlights an interesting application of clustering. If you begin with unlabeled data, you can use clustering to create class labels. From there, you could apply a supervised learner such as decision trees to find the most important predictors of these classes. This is called semi-supervised learning. The k-means clustering algorithm The k-means algorithm is perhaps the most commonly used clustering method. Having been studied for several decades, it serves as the foundation for many more sophisticated clustering techniques. If you understand the simple principles it uses, you will have the knowledge needed to understand nearly any clustering algorithm in use today. Many such methods are listed on the following site, the CRAN Task View for clustering at http://cran.r-project.org/web/views/Cluster.html. As k-means has evolved over time, there are many implementations of the algorithm. One popular approach is described in : Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics. 1979; 28:100-108. Even though clustering methods have advanced since the inception of k-means, this is not to imply that k-means is obsolete. In fact, the method may be more popular now than ever. The following table lists some reasons why k-means is still used widely: Strengths Weaknesses Uses simple principles that can be explained in non-statistical terms Highly flexible, and can be adapted with simple adjustments to address nearly all of its shortcomings Performs well enough under many real-world use cases Not as sophisticated as more modern clustering algorithms Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters Requires a reasonable guess as to how many clusters naturally exist in the data Not ideal for non-spherical clusters or clusters of widely varying density The k-means algorithm assigns each of the n examples to one of the k clusters, where k is a number that has been determined ahead of time. The goal is to minimize the differences within each cluster and maximize the differences between the clusters. Unless k and n are extremely small, it is not feasible to compute the optimal clusters across all the possible combinations of examples. Instead, the algorithm uses a heuristic process that finds locally optimal solutions. Put simply, this means that it starts with an initial guess for the cluster assignments, and then modifies the assignments slightly to see whether the changes improve the homogeneity within the clusters. We will cover the process in depth shortly, but the algorithm essentially involves two phases. First, it assigns examples to an initial set of k clusters. Then, it updates the assignments by adjusting the cluster boundaries according to the examples that currently fall into the cluster. The process of updating and assigning occurs several times until changes no longer improve the cluster fit. At this point, the process stops and the clusters are finalized. Due to the heuristic nature of k-means, you may end up with somewhat different final results by making only slight changes to the starting conditions. If the results vary dramatically, this could indicate a problem. For instance, the data may not have natural groupings or the value of k has been poorly chosen. With this in mind, it's a good idea to try a cluster analysis more than once to test the robustness of your findings. To see how the process of assigning and updating works in practice, let's revisit the case of the hypothetical data science conference. Though this is a simple example, it will illustrate the basics of how k-means operates under the hood. Using distance to assign and update clusters As with k-NN, k-means treats feature values as coordinates in a multidimensional feature space. For the conference data, there are only two features, so we can represent the feature space as a two-dimensional scatterplot as depicted previously. The k-means algorithm begins by choosing k points in the feature space to serve as the cluster centers. These centers are the catalyst that spurs the remaining examples to fall into place. Often, the points are chosen by selecting k random examples from the training dataset. As we hope to identify three clusters, according to this method, k = 3 points will be selected at random. These points are indicated by the star, triangle, and diamond in the following diagram: It's worth noting that although the three cluster centers in the preceding diagram happen to be widely spaced apart, this is not always necessarily the case. Since they are selected at random, the three centers could have just as easily been three adjacent points. As the k-means algorithm is highly sensitive to the starting position of the cluster centers, this means that random chance may have a substantial impact on the final set of clusters. To address this problem, k-means can be modified to use different methods for choosing the initial centers. For example, one variant chooses random values occurring anywhere in the feature space (rather than only selecting among the values observed in the data). Another option is to skip this step altogether; by randomly assigning each example to a cluster, the algorithm can jump ahead immediately to the update phase. Each of these approaches adds a particular bias to the final set of clusters, which you may be able to use to improve your results. In 2007, an algorithm called k-means++ was introduced, which proposes an alternative method for selecting the initial cluster centers. It purports to be an efficient way to get much closer to the optimal clustering solution while reducing the impact of random chance. For more information, refer to Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. 2007:1027–1035. After choosing the initial cluster centers, the other examples are assigned to the cluster center that is nearest according to the distance function. You will remember that we studied distance functions while learning about k-Nearest Neighbors. Traditionally, k-means uses Euclidean distance, but Manhattan distance or Minkowski distance are also sometimes used. Recall that if n indicates the number of features, the formula for Euclidean distance between example x and example y is: For instance, if we are comparing a guest with five computer science publications and one math publication to a guest with zero computer science papers and two math papers, we could compute this in R as follows: > sqrt((5 - 0)^2 + (1 - 2)^2) [1] 5.09902 Using this distance function, we find the distance between each example and each cluster center. The example is then assigned to the nearest cluster center. Keep in mind that as we are using distance calculations, all the features need to be numeric, and the values should be normalized to a standard range ahead of time. As shown in the following diagram, the three cluster centers partition the examples into three segments labeled Cluster A, Cluster B, and Cluster C. The dashed lines indicate the boundaries for the Voronoi diagram created by the cluster centers. The Voronoi diagram indicates the areas that are closer to one cluster center than any other; the vertex where all the three boundaries meet is the maximal distance from all three cluster centers. Using these boundaries, we can easily see the regions claimed by each of the initial k-means seeds: Now that the initial assignment phase has been completed, the k-means algorithm proceeds to the update phase. The first step of updating the clusters involves shifting the initial centers to a new location, known as the centroid, which is calculated as the average position of the points currently assigned to that cluster. The following diagram illustrates how as the cluster centers shift to the new centroids, the boundaries in the Voronoi diagram also shift and a point that was once in Cluster B (indicated by an arrow) is added to Cluster A: As a result of this reassignment, the k-means algorithm will continue through another update phase. After shifting the cluster centroids, updating the cluster boundaries, and reassigning points into new clusters (as indicated by arrows), the figure looks like this: Because two more points were reassigned, another update must occur, which moves the centroids and updates the cluster boundaries. However, because these changes result in no reassignments, the k-means algorithm stops. The cluster assignments are now final: The final clusters can be reported in one of the two ways. First, you might simply report the cluster assignments such as A, B, or C for each example. Alternatively, you could report the coordinates of the cluster centroids after the final update. Given either reporting method, you are able to define the cluster boundaries by calculating the centroids or assigning each example to its nearest cluster. Choosing the appropriate number of clusters In the introduction to k-means, we learned that the algorithm is sensitive to the randomly-chosen cluster centers. Indeed, if we had selected a different combination of three starting points in the previous example, we may have found clusters that split the data differently from what we had expected. Similarly, k-means is sensitive to the number of clusters; the choice requires a delicate balance. Setting k to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data. Ideally, you will have a priori knowledge (a prior belief) about the true groupings and you can apply this information to choosing the number of clusters. For instance, if you were clustering movies, you might begin by setting k equal to the number of genres considered for the Academy Awards. In the data science conference seating problem that we worked through previously, k might reflect the number of academic fields of study that were invited. Sometimes the number of clusters is dictated by business requirements or the motivation for the analysis. For example, the number of tables in the meeting hall could dictate how many groups of people should be created from the data science attendee list. Extending this idea to another business case, if the marketing department only has resources to create three distinct advertising campaigns, it might make sense to set k = 3 to assign all the potential customers to one of the three appeals. Without any prior knowledge, one rule of thumb suggests setting k equal to the square root of (n / 2), where n is the number of examples in the dataset. However, this rule of thumb is likely to result in an unwieldy number of clusters for large datasets. Luckily, there are other statistical methods that can assist in finding a suitable k-means cluster set. A technique known as the elbow method attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k. As illustrated in the following diagrams, the homogeneity within clusters is expected to increase as additional clusters are added; similarly, heterogeneity will also continue to decrease with more clusters. As you could continue to see improvements until each example is in its own cluster, the goal is not to maximize homogeneity or minimize heterogeneity, but rather to find k so that there are diminishing returns beyond that point. This value of k is known as the elbow point because it looks like an elbow. There are numerous statistics to measure homogeneity and heterogeneity within the clusters that can be used with the elbow method (the following information box provides a citation for more detail). Still, in practice, it is not always feasible to iteratively test a large number of k values. This is in part because clustering large datasets can be fairly time consuming; clustering the data repeatedly is even worse. Regardless, applications requiring the exact optimal set of clusters are fairly rare. In most clustering applications, it suffices to choose a k value based on convenience rather than strict performance requirements. For a very thorough review of the vast assortment of cluster performance measures, refer to: Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. Journal of Intelligent Information Systems. 2001; 17:107-145. The process of setting k itself can sometimes lead to interesting insights. By observing how the characteristics of the clusters change as k is varied, one might infer where the data have naturally defined boundaries. Groups that are more tightly clustered will change a little, while less homogeneous groups will form and disband over time. In general, it may be wise to spend little time worrying about getting k exactly right. The next example will demonstrate how even a tiny bit of subject-matter knowledge borrowed from a Hollywood film can be used to set k such that actionable and interesting clusters are found. As clustering is unsupervised, the task is really about what you make of it; the value is in the insights you take away from the algorithm's findings. Summary This article covered only the fundamentals of clustering. As a very mature machine learning method, there are many variants of the k-means algorithm as well as many other clustering algorithms that bring unique biases and heuristics to the task. Based on the foundation in this article, you will be able to understand and apply other clustering methods to new problems. To learn more about different machine learning techniques, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r) Mastering Scientific Computing with R (https://www.packtpub.com/application-development/mastering-scientific-computing-r) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) Resources for Article:   Further resources on this subject: Displaying SQL Server Data using a Linq Data Source [article] Probability of R? [article] Working with Commands and Plugins [article]
Read more
  • 0
  • 0
  • 18351

article-image-getting-started-with-the-confluent-platform-apache-kafka-for-enterprise
Amarabha Banerjee
27 Feb 2018
9 min read
Save for later

Getting started with the Confluent Platform: Apache Kafka for enterprise

Amarabha Banerjee
27 Feb 2018
9 min read
This article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book will show how to use Kafka efficiently with practical solutions to the common problems that developers and administrators usually face while working with it. In today’s tutorial, we will talk about the confluent platform and how to get started with organizing and managing data from several sources in one high-performance and reliable system. The Confluent Platform is a full stream data system. It enables you to organize and manage data from several sources in one high-performance and reliable system. As mentioned in the first few chapters, the goal of an enterprise service bus is not only to provide the system a means to transport messages and data but also to provide all the tools that are required to connect the data origins (data sources), applications, and data destinations (data sinks) to the platform. The Confluent Platform has these parts: Confluent Platform open source Confluent Platform enterprise Confluent Cloud The Confluent Platform open source has the following components: Apache Kafka core Kafka Streams Kafka Connect Kafka clients Kafka REST Proxy Kafka Schema Registry The Confluent Platform enterprise has the following components: Confluent Control Center Confluent support, professional services, and consulting All the components are open source except the Confluent Control Center, which is a proprietary of Confluent Inc. An explanation of each component is as follows: Kafka core: The Kafka brokers discussed at the moment in this book. Kafka Streams: The Kafka library used to build stream processing systems. Kafka Connect: The framework used to connect Kafka with databases, stores, and filesystems. Kafka clients: The libraries for writing/reading messages to/from Kafka. Note that there clients for these languages: Java, Scala, C/C++, Python, and Go. Kafka REST Proxy: If the application doesn't run in the Kafka clients' programming languages, this proxy allows connecting to Kafka through HTTP. Kafka Schema Registry: Recall that an enterprise service bus should have a message template repository. The Schema Registry is the repository of all the schemas and their historical versions, made to ensure that if an endpoint changes, then all the involved parts are acknowledged. Confluent Control Center: A powerful web graphic user interface for managing and monitoring Kafka systems. Confluent Cloud: Kafka as a service—a cloud service to reduce the burden of operations. Installing the Confluent Platform In order to use the REST proxy and the Schema Registry, we need to install the Confluent Platform. Also, the Confluent Platform has important administration, operation, and monitoring features fundamental for modern Kafka production systems. Getting ready At the time of writing this book, the Confluent Platform Version is 4.0.0. Currently, the supported operating systems are: Debian 8 Red Hat Enterprise Linux CentOS 6.8 or 7.2 Ubuntu 14.04 LTS and 16.04 LTS macOS currently is just supported for testing and development purposes, not for production environments. Windows is not yet supported. Oracle Java 1.7 or higher is required. The default ports for the components are: 2181: Apache ZooKeeper 8081: Schema Registry (REST API) 8082: Kafka REST Proxy 8083: Kafka Connect (REST API) 9021: Confluent Control Center 9092: Apache Kafka brokers It is important to have these ports, or the ports where the components are going to run, Open How to do it There are two ways to install: downloading the compressed files or with apt-get command. To install the compressed files: Download the Confluent open source v4.0 or Confluent Enterprise v4.0 TAR files from https://www.confluent.io/download/ Uncompress the archive file (the recommended path for installation is under /opt) To start the Confluent Platform, run this command: $ <confluent-path>/bin/confluent start The output should be as follows: Starting zookeeper zookeeper is [UP] Starting kafka kafka is [UP] Starting schema-registry schema-registry is [UP] Starting kafka-rest kafka-rest is [UP] Starting connect connect is [UP] To install with the apt-get command (in Debian and Ubuntu): Install the Confluent public key used to sign the packages in the APT repository: $ wget -qO - http://packages.confluent.io/deb/4.0/archive.key |sudo apt-key add - Add the repository to the sources list: $ sudo add-apt-repository "deb [arch=amd64] http://packages.confluent.io/deb/4.0 stable main" Finally, run the apt-get update to install the Confluent Platform To install Confluent open source: $ sudo apt-get update && sudo apt-get install confluent-platformoss- 2.11 To install Confluent Enterprise: $ sudo apt-get update && sudo apt-get install confluentplatform-2.11 The end of the package name specifies the Scala version. Currently, the supported versions are 2.11 (recommended) and 2.10. There's more The Confluent Platform provides the system and component packages. The commands in this recipe are for installing all components of the platform. To install individual components, follow the instructions on this page: https://docs.confluent.io/current/installation/available_packages.html#avaiIable-packages. Using Kafka operations With the Confluent Platform installed, the administration, operation, and monitoring of Kafka become very simple. Let's review how to operate Kafka with the Confluent Platform. Getting ready For this recipe, Confluent should be installed, up, and running. How to do it The commands in this section should be executed from the directory where the Confluent Platform is installed: To start ZooKeeper, Kafka, and the Schema Registry with one command, run: $ confluent start schema-registry The output of this command should be: Starting zookeeper zookeeper is [UP] Starting kafka kafka is [UP] Starting schema-registry schema-registry is [UP] To execute the commands outside the installation directory, add Confluent's bin directory to PATH: export PATH=<path_to_confluent>/bin:$PATH To manually start each service with its own command, run: $ ./bin/zookeeper-server-start ./etc/kafka/zookeeper.properties $ ./bin/kafka-server-start ./etc/kafka/server.properties $ ./bin/schema-registry-start ./etc/schema-registry/schemaregistry. properties Note that the syntax of all the commands is exactly the same as always but without the .sh extension. To create a topic called test_topic, run the following command: $ ./bin/kafka-topics --zookeeper localhost:2181 --create --topic test_topic --partitions 1 --replication-factor 1 To send an Avro message to test_topic in the broker without writing a single line of code, use the following command: $ ./bin/kafka-avro-console-producer --broker-list localhost:9092 --topic test_topic --property value.schema='{"name":"person","type":"record", "fields":[{"name":"name","type":"string"},{"name":"age","type":"int "}]}' Send some messages and press Enter after each line: {"name": "Alice", "age": 27} {"name": "Bob", "age": 30} {"name": "Charles", "age":57} Enter with an empty line is interpreted as null. To shut down the process, press Ctrl + C. To consume the Avro messages from test_topic since the beginning, type: $ ./bin/kafka-avro-console-consumer --topic test_topic --zookeeper localhost:2181 --from-beginning The messages created in the previous step will be written to the console in the format they were introduced. To shut down the consumer, press Ctrl + C. To test the Avro schema validation, try to produce data on the same topic using an incompatible schema, for example, with this producer: $ ./bin/kafka-avro-console-producer --broker-list localhost:9092 --topic test_topic --property value.schema='{"type":"string"}' After you've hit Enter on the first message, the following exception is raised: org.apache.kafka.common.errors.SerializationException: Error registering Avro schema: "string" Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClient Exception: Schema being registered is incompatible with the latest schema; error code: 409 at io.confluent.kafka.schemaregistry.client.rest.utils.RestUtils.httpR equest(RestUtils.java:146) To shut down the services (Schema Registry, broker, and ZooKeeper) run: confluent stop To delete all the producer messages stored in the broker, run this: confluent destroy There's more With the Confluent Platform, it is possible to manage all of the Kafka system through the Kafka operations, which are classified as follows: Production deployment: Hardware configuration, file descriptors, and ZooKeeper configuration Post deployment: Admin operations, rolling restart, backup, and restoration Auto data balancing: Rebalancer execution and decommissioning brokers Monitoring: Metrics for each concept—broker, ZooKeeper, topics, producers, and consumers Metrics reporter: Message size, security, authentication, authorization, and verification Monitoring with the Confluent Control Center This recipe shows you how to use the metrics reporter of the Confluent Control Center. Getting ready The execution of the previous recipe is needed. Before starting the Control Center, configure the metrics reporter: Back up the server.properties file located at: <confluent_path>/etc/kafka/server.properties In the server.properties file, uncomment the following lines: metric.reporters=io.confluent.metrics.reporter.ConfluentMetricsRepo rter confluent.metrics.reporter.bootstrap.servers=localhost:9092 confluent.metrics.reporter.topic.replicas=1 Back up the Kafka Connect configuration located in: <confluent_path>/etc/schema-registry/connect-avrodistributed.properties Add the following lines at the end of the connect-avrodistributed.properties file: consumer.interceptor.classes=io.confluent.monitoring.clients.interc eptor.MonitoringConsumerInterceptor producer.interceptor.classes=io.confluent.monitoring.clients.interc eptor.MonitoringProducerInterceptor Start the Confluent Platform: $ <confluent_path>/bin/confluent start Before starting the Control Center, change its configuration: Back up the control-center.properties file located in: <confluent_path>/etc/confluent-control-center/controlcenter.properties Add the following lines at the end of the control-center.properties file: confluent.controlcenter.internal.topics.partitions=1 confluent.controlcenter.internal.topics.replication=1 confluent.controlcenter.command.topic.replication=1 confluent.monitoring.interceptor.topic.partitions=1 confluent.monitoring.interceptor.topic.replication=1 confluent.metrics.topic.partitions=1 confluent.metrics.topic.replication=1 Start the Control Center: <confluent_path>/bin/control-center-start How to do it Open the Control Center web graphic user interface at the following URL: http://localhost:9021/. The test_topic created in the previous recipe is needed: $ <confluent_path>/bin/kafka-topics --zookeeper localhost:2181 -- create --test_topic --partitions 1 --replication-factor 1 From the Control Center, click on the Kafka Connect button on the left. Click on the New source button: 4. From the connector class, drop down the menu and select SchemaSourceConnector. Specify Connection Name as Schema-Avro-Source. 5. In the topic name, specify test_topic. 6. Click on Continue, and then click on the Save & Finish button to apply the configuration. To create a new sink follow these steps: From Kafka Connect, click on the SINKS button and then on the New sink button: From the topics list, choose test_topic and click on the Continue button In the SINKS tab, set the connection class to SchemaSourceConnector; specify Connection Name as Schema-Avro-Source Click on the Continue button and then on Save & Finish to apply the new configuration How it works Click on the Data streams tab and a chart shows the total number of messages produced and consumed on the cluster: To summarize, we discussed how to get started with the Apache Kafka confluent platform. If you liked our post, please be sure to check out Apache Kafka 1.0 Cookbook  which consists of useful recipes to work with your Apache Kafka installation.
Read more
  • 0
  • 0
  • 18216
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-how-to-prevent-errors-while-using-utilities-for-loading-data-in-teradata
Pravin Dhandre
11 Jun 2018
9 min read
Save for later

How to prevent errors while using utilities for loading data in Teradata

Pravin Dhandre
11 Jun 2018
9 min read
In today’s tutorial we will assist you to overcome the errors that arise while loading, deleting or updating large volumes of data using Teradata Utilities. [box type="note" align="" class="" width=""]This article is an excerpt from Teradata Cookbook co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati. This book provides recipes to simplify the daily tasks performed by database administrators (DBA) along with providing efficient data warehousing solutions in Teradata database system.[/box] Resolving FastLoad error 2652 When data is being loaded via FastLoad, a table lock is placed on the target table. This means that the table is unavailable for any other operation. A lock on a table is only released when FastLoad encounters the END LOADING command, which terminates phase 2, the so-called application phase. FastLoad may get terminated in phase 1 due to any of the following reasons: Load script results in failure (error code 8 or 12) Load script is aborted by admin or some other session FastLoad fails due to bad record or file Forgetting to add end loading statement in script If so, it keeps a lock on the table, which needs to be released manually. In this recipe, we will see the steps to release FastLoad locks. Getting ready Identify the table on which FastLoad is been ended prematurely and tables are in locked state. You need to have valid credentials for the Teradata Database. Execute the dummy FastLoad script from the same user or the user which has write access to the lock table. A user requires the following privileges/rights in order to execute the FastLoad: SELECT and INSERT (CREATE and DROP or DELETE) access to the target or loading table CREATE and DROP TABLE on error tables SELECT, INSERT, UPDATE, and DELETE are required privileges for the user PUBLIC on the restart log table (SYSADMIN.FASTLOG). There will be a row in the FASTLOG table for each FastLoad job that has not completed in the system. How to do it... Open a notepad and create the following script: .LOGON 127.0.0.1/dbc, dbc; /* Vaild system name and credentials to your system */ .DATABASE Database_Name; /* database under which locked table is */ erorfiles errortable_name, uv_tablename /* same error table name as in script */ begin loading locked_table; /* table which is getting 2652 error */ .END LOADING; /* to end pahse 2 and release the lock */ .LOGOFF; Save it as dummy_fl.txt. Open the windows Command Prompt and execute this using the FastLoad command, as shown in the following screenshot: This dummy script with no insert statement should release the lock on the target Table. Execute Select on the locked table to see if the lock is released on the table. How it works... As FastLoad is designed to work only on empty tables, it becomes necessary that the loading of the table finishes in one go. If the load script is errored out prematurely in phase 2, without encountering the END loading command, it leaves a lock on loading the table. Fastload locks can't be released via the HUT utility, as there are no technical lock on the table. To execute FastLoad, the following are some requirements: Log table: FastLoad puts its progress information in the fastlog table. EMPTY TABLE: FastLoad needs the table to be empty before inserting rows into that table. TWO ERROR TABLES: FastLoad requires two error tables to be created; you just need to name them, and no ddl is required. The first error table records any translation or constraint violation error, whereas the second error table captures errors related to the duplication of values for Unique Primary Indexes (UPI). After the completion of FastLoad, you can analyze these error tables as to why the records got rejected. There's more... If this does not fix the issue, you need to drop the target table and error tables associated with it. Before proceeding with dropping tables, check with the administrator to abort any FastLoad sessions associated with this table. Resolving MLOAD error 2571 MLOAD works in five phases, unlike FastLoad, which only works in two phases. MLOAD can fail in either phase three or four. Figure shows 5 stages of MLOAD. Preliminary: Basic setup. Syntax checking, establishing session with the Teradata Database, creation of error tables (two error tables per target table), and the creation of work tables and log tables are done in this phase. DML Transaction phase: Request is parse through PE and a step plan is generated. Steps and DML are then sent to AMP and stored in appropriate work tables for each target table. Input data sent will be stored in these work tables, which will be applied to the target table later on. Acquisition phase: Unsorted data is sent to AMP in blocks of 64K. Rows are hashed by PI and sent to appropriate AMPs. Utility places locks on target tables in preparation for the application phase to apply rows in target tables. Application phase: Changes are applied to target tables and NUSI subtables. Lock on table is held in this phase. Cleanup phase: If the error code of all the steps is 0, MLOAD successfully completes and releases all the locks on the specified table. This being the case, all empty error tables, worktables, and the log table are dropped. Getting ready Identify the table which is getting affected by error 2571. Make sure no host utility is running on this table and the load job is in a failed state for this table. How to do it... Check on viewpoint for any active utility job for this table. If you find any active job, let it complete. If there is a reason that you need to release the lock, first abort all the sessions of the host utility from viewpoint. Ask your administrator to do it. Execute the following command: RELEASE MLOAD <databasename.tablename>; > If you get a Not able to release MLOAD Lock error, execute the following Command: /* Release lock in application phase */ RELEASE MLOAD <databasename.tablename> in apply; Once the locks are released you need to drop all the associated error tables, the log table, and work tables with it. Re-execute MLOAD after correcting the error. How it works... The Mload utility places a lock in table headers to alert other utilities that a MultiLoad is in session for this table. They include: Acquisition lock: DML allows all DDL allows DROP only Application lock: DML allows SELECT with ACCESS only DDL allows DROP only There's more... If the release lock statement still gives an error and does not release the lock on the table, you need to use SELECT with the ACCESS lock to copy the content of the locked table to a new one and drop the locked tables. If you start receiving the error 7446 Mload table %ID cannot be released because NUSI exists, you need to drop all the NUSI on the table and use ALTER Table to nonfallback to accomplish the task. Resolving failure 7547 This error is associated with the UPDATE statement, which could be SQL based or could be in MLOAD. Various times, while updating the set of rows in a table, the update fails on Failure 7547 Target row updated by multiple source rows. This error will happen when you update the target with multiple rows from the source. This means there are duplicated values present in the source tables. Getting ready Let's create sample volatile tables and insert values into them. After that, we will execute the UPDATE command, which will fail to result in 7547: Create a TARGET TABLE with the following DDL and insert values into it: ** TARGET TABLE** create volatile table accounts ( CUST_ID, CUST_NAME, Sal )with data primary index(cust_id) insert values (1,'will',2000); insert values (2,'bekky',2800); insert values (3,'himesh',4000); Create a SOURCE TABLE with the following DDL and insert values into it: ** SOURCE TABLE** create volatile table Hr_payhike ( CUST_ID, CUST_NAME, Sal_hike ) with data primary index(cust_id) insert values (1,'will',2030); insert values (1,'bekky',3800); insert values (3,'himesh',7000); Execute the MLOAD script. Following the snippet from the MLOAD script, only update part (which will fail): /* Snippet from MLOAD update */ UPDATE ACC FROM ACCOUNTS ACC , Hr_payhike SUPD SET Sal= TUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Failure: Target row updated by multiple source rows How to do it... Check for duplicate values in the source table using the following: /*Check for duplicate values in source table*/ SELECT cust_id,count(*) from Hr_payhike group by 1 order by 2 desc The output will be generated with CUST_ID =1 and has two values which are causing errors. The reason for this is that while updating the TARGET table, the optimizer won't be able to understand from which row it should update the TARGET row. Who's salary will be updated Will or Bekky? To resolve the error, execute the following update query: /* Update part of MLOAD */ UPDATE ACC FROM ACCOUNTS ACC , ( SELECT CUST_ID, CUST_NAME, SAL_HIKE FROM Hr_payhike QUALIFY ROW_NUMBER() OVER (PARTITION BY CUST_ID ORDER BY CUST_NAME,SAL_HIKE DESC)=1) SUPD SET Sal= SUPD.Sal_hike WHERE Acc.CUST_ID = SUPD.CUST_ID; Now, the update will run without error. How it works... Failure will happen when you update the target with multiple rows from the source. If you defined a primary index column for your target, and if those columns are in an update query condition, this error will occur. To further resolve this, you can delete the duplicate from the source table itself and execute the original update without any modification. But if the source data can't be changed, then you need to change the update statement. To summarize, we have successfully learned how to overcome or prevent errors while using utilities for loading data into database. You could also check out the Teradata Cookbook  for more than 100 recipes on enterprise data warehousing solutions. 2018 is the year of graph databases. Here’s why. 6 reasons to choose MySQL 8 for designing database solutions Amazon Neptune, AWS’ cloud graph database, is now generally available
Read more
  • 0
  • 0
  • 18203

article-image-accountability-and-algorithmic-bias-why-diversity-and-inclusion-matters-neurips-invited-talk
Sugandha Lahoti
08 Dec 2018
4 min read
Save for later

Accountability and algorithmic bias: Why diversity and inclusion matters [NeurIPS Invited Talk]

Sugandha Lahoti
08 Dec 2018
4 min read
One of the most awaited machine learning conference, NeurIPS 2018 is happening throughout this week in Montreal, Canada. It will feature a series of tutorials, invited talks, product releases, demonstrations, presentations, and announcements related to machine learning research. For the first time, NeurIPS invited a diversity and inclusion (D&I) speaker Laura Gomez to talk about the lack of diversity in the tech industry, which leads to biased algorithms, faulty products, and unethical tech. Laura Gomez is the CEO of Atipica that helps tech companies find and hire diverse candidates. Being a Latina woman herself, she had to face oppression when seeking capital and funds for her startup trying to establish herself in Silicon Valley. This experience led to her realization that there is a strong need to talk about why diversity and inclusion matters. Her efforts were not in vain and recently, she raised $2M in seed funding led by True Ventures. “At Atipica, we think of Inclusive AI in terms of data science, algorithms, and their ethical implications. This way you can rest assure our models are not replicating the biases of humans that hinder diversity while getting patent-pending aggregate demographic insights of your talent pool,” reads the website. She talks about her journey as a Latina woman in the tech industry. She reminisced on how she was the only one like her who got an internship with Hewlett Packard and the fact that she hated it. Nevertheless, she still decided to stay, determined not to let the industry turn her into a victim. She believes she made the right choice going forward with tech; now, years later, diversity is dominating the conversation in the industry. After HP, she also worked at Twitter and YouTube, helping them translate and localize their applications for a global audience. She is also a founding advisor of Project Include, which is a non-profit organization run by women, that uses data and advocacy to accelerate diversity and inclusion solutions in the tech industry. She opened her talk by agreeing to a quote from Safiya Noble, who wrote Algorithms of Oppression. “Artificial Intelligence will become a major human rights issue in the twenty-first century.” She believes we need to talk about difficult questions such as where AI is heading? And where should we hold ourselves and each other accountable.” She urges people to evaluate their role in AI, bias, and inclusion, to find the empathy and value in difficult conversations, and to go beyond your immediate surroundings to consider the broader consequences. It is important to build accountable AI in a way that allows humanity to triumph. She touched upon discriminatory moves by tech giants like Amazon and Google. Amazon recently killed off its AI recruitment tool because it couldn’t stop discriminating against women. She also criticized upon Facebook’s Myanmar operation where Facebook data scientists were building algorithms for hate speech. They didn’t understand the importance of localization or language or actually internationalize their own algorithms to be inclusive towards all the countries. She also talked about algorithmic bias in library discovery systems, as well as how even ‘black robots’ are being impacted by racism. She also condemned Palmer Luckey's work who is helping U.S. immigration agents on the border wall identify Latin refugees. Finally, she urged people to take three major steps to progress towards being inclusive: Be an ally Think of inclusion as an approach, not a feature Work towards an Ethical AI Head over to NeurIPS facebook page for the entire talk and other sessions happening at the conference this week. NeurIPS 2018: Deep learning experts discuss how to build adversarially robust machine learning models NeurIPS 2018 paper: DeepMind researchers explore autoregressive discrete autoencoders (ADAs) to model music in raw audio at scale NeurIPS 2018: A quick look at data visualization for Machine learning by Google PAIR researchers [Tutorial]
Read more
  • 0
  • 0
  • 18166

article-image-experts-present-most-pressing-issues-facing-global-lawmakers-on-citizens-privacy-democracy-and-rights-to-freedom-of-speech
Sugandha Lahoti
31 May 2019
17 min read
Save for later

Experts present most pressing issues facing global lawmakers on citizens’ privacy, democracy and rights to freedom of speech

Sugandha Lahoti
31 May 2019
17 min read
The Canadian Parliament's Standing Committee on Access to Information, Privacy, and Ethics are hosting a hearing on Big Data, Privacy and Democracy from Monday, May 27 to Wednesday, May 29 as a series of discussions with experts, and tech execs over the three days. The committee invited expert witnesses to testify before representatives from 12 countries ( Canada, United Kingdom, Singapore, Ireland, Germany, Chile, Estonia, Mexico, Morocco, Ecuador, St. Lucia, and Costa Rica) on how governments can protect democracy and citizen rights in the age of big data. The committee opened with a round table discussion where expert witnesses spoke about what they believe to be the most pressing issues facing lawmakers when it comes to protecting the rights of citizens in the digital age. Expert witnesses that took part were: Professor Heidi Tworek, University of British Columbia Jason Kint, CEO of Digital Content Next Taylor Owen, McGill University Ben Scott, The Center for Internet and Society, Stanford Law School Roger McNamee, Author of Zucked: Waking up to the Facebook Catastrophe Shoshana Zuboff, Author of The Age of Surveillance Capitalism Maria Ressa, Chief Executive Officer and Executive Editor of Rappler Inc. Jim Balsillie, Chair, Centre for International Governance Innovation The session was led by Bob Zimmer, M.P. and Chair of the Standing Committee on Access to Information, Privacy and Ethics. Other members included Nathaniel Erskine-Smith, and Charlie Angus, M.P. and Vice-Chair of the Standing Committee on Access to Information, Privacy and Ethics. Also present was Damian Collins, M.P. and Chair of the UK Digital, Culture, Media and Sport Committee. Testimonies from the witnesses “Personal data matters more than context”, Jason Kint, CEO of Digital Content Next The presentation started with Mr. Jason Kint, CEO of Digital Content Next, a US based Trade association, who thanked the committee and appreciated the opportunity to speak on behalf of 80 high-quality digital publishers globally. He begins by saying how DCN has prioritized shining a light on issues that erode trust in the digital marketplace, including a troubling data ecosystem that has developed with very few legitimate constraints on the collection and use of data about consumers. As a result personal data is now valued more highly than context, consumer expectations, copyright, and even facts themselves. He believes it is vital that policymakers begin to connect the dots between the three topics of the committee's inquiry, data privacy, platform dominance and, societal impact. He says that today personal data is frequently collected by unknown third parties without consumer knowledge or control. This data is then used to target consumers across the web as cheaply as possible. This dynamic creates incentives for bad actors, particularly on unmanaged platforms, like social media, which rely on user-generated content mostly with no liability. Here the site owners are paid on the click whether it is from an actual person or a bot on trusted information or on disinformation. He says that he is optimistic about regulations like the GDPR in the EU which contain narrow purpose limitations to ensure companies do not use data for secondary uses. He recommends exploring whether large tech platforms that are able to collect data across millions of devices, websites, and apps should even be allowed to use this data for secondary purposes. He also applauds the decision of the German cartel office to limit Facebook's ability to collect and use data across its apps and the web. He further says that issues such as bot fraud, malware, ad blockers, clickbait, privacy violations and now disinformation are just symptoms. The root cause is unbridled data collection at the most personal level.  Four years ago DC ended the original financial analysis labeling Google and Facebook the duopoly of digital advertising. In a 150+ billion dollar digital ad market across the North America and the EU, 85 to 90 percent of the incremental growth is going to just these two companies. DNC dug deeper and connected the revenue concentration to the ability of these two companies to collect data in a way that no one else can. This means both companies know much of your browsing history and your location history. The emergence of this duopoly has created a misalignment between those who create the content and those who profit from it. The scandal involving Facebook and Cambridge analytic underscores the current dysfunctional dynamic. With the power Facebook has over our information ecosystem our lives and our democratic systems it is vital to know whether we can trust the company. He also points out that although, there's been a well documented and exhausting trail of apologies, there's been little or no change in the leadership or governance of Facebook. In fact the company has repeatedly refused to have its CEO offer evidence to pressing international government. He believes there should be a deeper probe as there's still much to learn about what happened and how much Facebook knew about the Cambridge Analytica scandal before it became public. Facebook should be required to have an independent audit of its user account practices and its decisions to preserve or purge real and fake accounts over the past decade. He ends his testimony saying that it is critical to shed light on these issues to understand what steps must be taken to improve data protection. This includes providing consumers with greater transparency and choice over their personal data when using practices that go outside of the normal expectations of consumers. Policy makers globally must hold digital platforms accountable for helping to build a healthy marketplace and for restoring consumer trust and restoring competition. “We need a World Trade Organization 2.0 “, Jim Balsillie, Chair, Centre for International Governance Innovation; Retired Chairman and co-CEO of BlackBerry Jim begins by saying that Data governance is the most important public policy issue of our time. It is cross-cutting with economic, social, and security dimension. It requires both national policy frameworks and international coordination. A specific recommendation he brought forward in this hearing was to create a new institution for like-minded nations to address digital cooperation and stability. “The data driven economies effects cannot be contained within national borders”, he said, “we need new or reformed rules of the road for digitally mediated global commerce, a World Trade Organization 2.0”. He gives the example of Financial Stability Board which was created in the aftermath of the 2008 financial crisis to foster global financial cooperation and stability. He recommends forming a similar global institution, for example, digital stability board, to deal with the challenges posed by digital transformation. The nine countries on this committee plus the five other countries attending, totaling 14 could constitute founding members of this board which would undoubtedly grow over time. “Check business models of Silicon Valley giants”, Roger McNamee, Author of Zucked: Waking up to the Facebook Catastrophe Roger begins by saying that it is imperative that this committee and that nations around the world engage in a new thought process relative to the ways of controlling companies in Silicon Valley, especially to look at their business models. By nature these companies invade privacy and undermine democracy. He assures that there is no way to stop that without ending the business practices as they exist. He then commends Sri Lanka who chose to shut down the platforms in response to a terrorist act. He believes that that is the only way governments are going to gain enough leverage in order to have reasonable conversations. He explains more on this in his formal presentation, which took place yesterday. “Stop outsourcing policies to the private sector”, Taylor Owen, McGill University He begins by making five observations about the policy space that we’re in right now. First, self-regulation and even many of the forms of co-regulation that are being discussed have and will continue to prove insufficient for this problem. The financial incentives are simply powerfully aligned against meaningful reform. These are publicly traded largely unregulated companies whose shareholders and directors expect growth by maximizing a revenue model that it is self part of the problem. This growth may or may not be aligned with the public interest. Second, disinformation, hate speech, election interference, privacy breaches, mental health issues and anti-competitive behavior must be treated as symptoms of the problem not its cause. Public policy should therefore focus on the design and the incentives embedded in the design of the platforms themselves. If democratic governments determine that structure and design is leading to negative social and economic outcomes, then it is their responsibility to govern. Third, governments that are taking this problem seriously are converging on a markedly similar platform governance agenda. This agenda recognizes that there are no silver bullets to this broad set of problems and that instead, policies must be domestically implemented and internationally coordinated across three categories: Content policies which seek to address a wide range of both supply and demand issues about the nature amplification and legality of content in our digital public sphere. Data policies which ensure that public data is used for the public good and that citizens have far greater rights over the use, mobility, and monetization of their data. Competition policies which promote free and competitive markets in the digital economy. Fourth, the propensity when discussing this agenda to overcomplicate solutions serves the interests of the status quo. He then recommends sensible policies that could and should be implemented immediately: The online ad micro targeting market could be made radically more transparent and in many cases suspended entirely. Data privacy regimes could be updated to provide far greater rights to individuals and greater oversight and regulatory power to punish abuses. Tax policy can be modernized to better reflect the consumption of digital goods and to crack down on tax base erosion and profit sharing. Modernized competition policy can be used to restrict and rollback acquisitions and a separate platform ownership from application and product development. Civic media can be supported as a public good. Large-scale and long term civic literacy and critical thinking efforts can be funded at scale by national governments, not by private organizations. He then asks difficult policy questions for which there are neither easy solutions, meaningful consensus nor appropriate existing international institutions. How we regulate harmful speech in the digital public sphere? He says, that at the moment we've largely outsourced the application of national laws as well as the interpretation of difficult trade-offs between free speech and personal and public harms to the platforms themselves. Companies who seek solutions rightly in their perspective that can be implemented at scale globally. In this case, he argues that what is possible technically and financially for the companies might be insufficient for the goals of the public good or the public policy goals. What is liable for content online? He says that we’ve clearly moved beyond the notion of platform neutrality and absolute safe harbor but what legal mechanisms are best suited to holding platforms, their design, and those that run them accountable. Also, he asks how are we going to bring opaque artificial intelligence systems into our laws and norms and regulations? He concludes saying that these difficult conversation should not be outsourced to the private sector. They need to be led by democratically accountable governments and their citizens. “Make commitments to public service journalism”, Ben Scott, The Center for Internet and Society, Stanford Law School Ben states that technology doesn't cause the problem of data misinformation, and irregulation. It infact accelerates it. This calls for policies to be made to limit the exploitation of these technology tools by malignant actors and by companies that place profits over the public interest. He says, “we have to view our technology problem through the lens of the social problems that we're experiencing.” This is why the problem of political fragmentation or hate speech tribalism and digital media looks different in each countries. It looks different because it feeds on the social unrest, the cultural conflict, and the illiberalism that is native to each society. He says we need to look at problems holistically and understand that social media companies are a part of a system and they don't stand alone as the super villains. The entire media market has bent itself to the performance metrics of Google and Facebook. Television, radio, and print have tortured their content production and distribution strategies to get likes shares and and to appear higher in the Google News search results. And so, he says, we need a comprehensive public policy agenda and put red lines around the illegal content. To limit data collection and exploitation we need to modernize competition policy to reduce the power of monopolies. He also says, that we need to publicly educate people on how to help themselves and how to stop being exploited. We need to make commitments to public service journalism to provide alternatives for people, alternatives to the mindless stream of clickbait to which we have become accustomed. “Pay attention to the physical infrastructure”, Professor Heidi Tworek, University of British Columbia Taking inspiration from Germany's vibrant interwar media democracy as it descended into an authoritarian Nazi regime, Heidi lists five brief lessons that she thinks can guide policy discussions in the future. These can enable governments to build robust solutions that can make democracies stronger. Disinformation is also an international relations problem Information warfare has been a feature not a bug of the international system for at least a century. So the question is not if information warfare exists but why and when states engage in it. This happens often when a state feels encircled, weak or aspires to become a greater power than it already is. So if many of the causes of disinformation are geopolitical, we need to remember that many of the solutions will be geopolitical and diplomatic as well, she adds. Pay attention to the physical infrastructure Information warfare disinformation is also enabled by physical infrastructure whether it is the submarine cables a century ago or fiber optic cables today. 95 to 99 percent of international data flows through undersea fiber-optic cables. Google partly owns 8.5 percent of those submarine cables. Content providers also own physical infrastructure She says, Russia and China, for example are surveying European and North American cables. China we know as of investing in 5G but combining that with investments in international news networks. Business models matter more than individual pieces of content Individual harmful content pieces go viral because of the few companies that control the bottleneck of information. Only 29% of Americans or Brits understand that their Facebook newsfeed is algorithmically organized. The most aware are the Finns and there are only 39% of them that understand that. That invisibility can provide social media platforms an enormous amount of power that is not neutral. At a very minimum, she says, we need far more transparency about how algorithms work and whether they are discriminatory. Carefully design robust regulatory institutions She urges governments and the committee to democracy-proof whatever solutions,  come up with. She says, “we need to make sure that we embed civil society or whatever institutions we create.” She suggests an idea of forming social media councils that could meet regularly to actually deal with many such problems. The exact format and the geographical scope are still up for debate but it's an idea supported by many including the UN Special Rapporteur on freedom of expression and opinion, she adds. Address the societal divisions exploited by social media Heidi says, that the seeds of authoritarianism need fertile soil to grow and if we do not attend to the underlying economic and social discontents, better communications cannot obscure those problems forever. “Misinformation is effect of one shared cause, Surveillance Capitalism”, Shoshana Zuboff, Author of The Age of Surveillance Capitalism Shoshana also agrees with the committee about how the themes of platform accountability, data security and privacy, fake news and misinformation are all effects of one shared cause. She identifies this underlying cause as surveillance capitalism and defines  surveillance capitalism as a comprehensive systematic economic logic that is unprecedented. She clarifies that surveillance capitalism is not technology. It is also not a corporation or a group of corporations. This is infact a virus that has infected every economic sector from insurance, retail, publishing, finance all the way through to product and service manufacturing and administration all of these sectors. According to her, Surveillance capitalism cannot also be reduced to a person or a group of persons. Infact surveillance capitalism follows the history of market capitalism in the following way - it takes something that exists outside the marketplace and it brings it into the market dynamic for production and sale. It claims private human experience for the market dynamic. Private human experience is repurposed as free raw material which are rendered as behavioral data. Some of these behavioral data are certainly fed back into product and service improvement but the rest are declared of behavioral surplus identified for their rich predictive value. These behavioral surplus flows are then channeled into the new means of production what we call machine intelligence or artificial intelligence. From these come out prediction products. Surveillance capitalists own and control not one text but two. First is the public facing text which is derived from the data that we have provided to these entities. What comes out of these, the prediction products, is the proprietary text, a shadow text from which these companies have amassed high market capitalization and revenue in a very short period of time. These prediction products are then sold into a new kind of marketplace that trades exclusively in human futures. The first name of this marketplace was called online targeted advertising and the human predictions that were sold in those markets were called click-through rates. By now that these markets are no more confined to that kind of marketplace. This new logic of surveillance capitalism is being applied to anything and everything. She promises to discuss on more of this in further sessions. “If you have no facts then you have no truth. If you have no truth you have no trust”, Maria Ressa, Chief Executive Officer and Executive Editor of Rappler Inc. Maria believes that in the end it comes down to the battle for truth and journalists are on the front line of this along with activists. Information is power and if you can make people believe lies, then you can control them. Information can be used for commercial benefits as well as a means to gain geopolitical power. She says,  If you have no facts then you have no truth. If you have no truth you have no trust. She then goes on to introduce a bit about her formal presentation tomorrow saying that she will show exactly how quickly a nation, a democracy can crumble because of information operations. She says she will provide data that shows it is systematic and that it is an erosion of truth and trust.  She thanks the committee saying that what is so interesting about these types of discussions is that the countries that are most affected are democracies that are most vulnerable. Bob Zimmer concluded the meeting saying that the agenda today was to get the conversation going and more of how to make our data world a better place will be continued in further sessions. He said, “as we prepare for the next two days of testimony, it was important for us to have this discussion with those who have been studying these issues for years and have seen firsthand the effect digital platforms can have on our everyday lives. The knowledge we have gained tonight will no doubt help guide our committee as we seek solutions and answers to the questions we have on behalf of those we represent. My biggest concerns are for our citizens’ privacy, our democracy and that our rights to freedom of speech are maintained according to our Constitution.” Although, we have covered most of the important conversations, you can watch the full hearing here. Time for data privacy: DuckDuckGo CEO Gabe Weinberg in an interview with Kara Swisher ‘Facial Recognition technology is faulty, racist, biased, abusive to civil rights; act now to restrict misuse’ say experts to House Oversight and Reform Committee. A brief list of drafts bills in US legislation for protecting consumer data privacy
Read more
  • 0
  • 0
  • 18054

article-image-tinkering-with-ticks-in-matplotlib-2-0
Sugandha Lahoti
13 Dec 2017
6 min read
Save for later

Tinkering with ticks in Matplotlib 2.0

Sugandha Lahoti
13 Dec 2017
6 min read
[box type="note" align="" class="" width=""]This is an excerpt from the book titled Matplotlib 2.x By Example written by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim,. The book covers basic know-how on how to create and customize plots by Matplotlib. It will help you learn to visualize geographical data on maps and implement interactive charts. [/box] The article talks about how you can manipulate ticks in Matplotlib 2.0. It includes steps to adjust tick spacing, customizing tick formats, trying out the ticker locator and formatter, and rotating tick labels. What are Ticks Ticks are dividers on an axis that help readers locate the coordinates. Tick labels allow estimation of values or, sometimes, labeling of a data series as in bar charts and box plots. Adjusting tick spacing Tick spacing can be adjusted by calling the locator methods: ax.xaxis.set_major_locator(xmajorLocator) ax.xaxis.set_minor_locator(xminorLocator) ax.yaxis.set_major_locator(ymajorLocator) ax.yaxis.set_minor_locator(yminorLocator) Here, ax refers to axes in a Matplotlib figure. Since set_major_locator() or set_minor_locator() cannot be called from the pyplot interface but requires an axis, we call pyplot.gca() to get the current axes. We can also store a figure and axes as variables at initiation, which is especially useful when we want multiple axes. Removing ticks NullLocator: No ticks Drawing ticks in multiples Spacing ticks in multiples of a given number is the most intuitive way. This can be done by using MultipleLocator space ticks in multiples of a given value. Automatic tick settings MaxNLocator: This finds the maximum number of ticks that will display nicely AutoLocator: MaxNLocator with simple defaults AutoMinorLocator: Adds minor ticks uniformly when the axis is linear Setting ticks by the number of data points IndexLocator: Sets ticks by index (x = range(len(y)) Set scaling of ticks by mathematical functions LinearLocator: Linear scale LogLocator: Log scale SymmetricalLogLocator: Symmetrical log scale, log with a range of linearity LogitLocator: Logit scaling Locating ticks by datetime There is a series of locators dedicated to displaying date and time: MinuteLocator: Locate minutes HourLocator: Locate hours DayLocator: Locate days of the month WeekdayLocator: Locate days of the week MonthLocator: Locate months, for example, 8 for August YearLocator: Locate years that in multiples RRuleLocator: Locate using matplotlib.dates.rrulewrapper The rrulewrapper is a simple wrapper around a dateutil.rrule (dateutil) that allows almost arbitrary date tick specifications AutoDateLocator: On autoscale, this class picks the best MultipleDateLocator to set the view limits and the tick locations Customizing tick formats Tick formatters control the style of tick labels. They can be called to set the major and minor tick formats on the x and y axes as follows: ax.xaxis.set_major_formatter( xmajorFormatter ) ax.xaxis.set_minor_formatter( xminorFormatter ) ax.yaxis.set_major_formatter( ymajorFormatter ) ax.yaxis.set_minor_formatter( yminorFormatter ) Removing tick labels NullFormatter: No tick labels Fixing labels FixedFormatter: Labels are set manually Setting labels with strings IndexFormatter: Take labels from a list of strings StrMethodFormatter: Use the string format method Setting labels with user-defined functions FuncFormatter: Labels are set by a user-defined function Formatting axes by numerical values ScalarFormatter: The format string is automatically selected for scalars by default The following formatters set values for log axes: LogFormatter: Basic log axis LogFormatterExponent: Log axis using exponent = log_base(value) LogFormatterMathtext: Log axis using exponent = log_base(value) using Math text LogFormatterSciNotation: Log axis with scientific notation LogitFormatter: Probability formatter Trying out the ticker locator and formatter To demonstrate the ticker locator and formatter, here we use Netflix subscriber data as an example. Business performance is often measured seasonally. Television shows are even more "seasonal". Can we better show it in the timeline? import numpy as np import matplotlib.pyplot as plt import matplotlib.ticker as ticker """ Number for Netflix streaming subscribers from 2012-2017 Data were obtained from Statista on https://www.statista.com/statistics/250934/quarterly-number-of-netflix-stre aming-subscribers-worldwide/ on May 10, 2017. The data were originally published by Netflix in April 2017. """ # Prepare the data set x = range(2011,2018) y = [26.48,27.56,29.41,33.27,36.32,37.55,40.28,44.35, 48.36,50.05,53.06,57.39,62.27,65.55,69.17,74.76,81.5, 83.18,86.74,93.8,98.75] # quarterly subscriber count in millions # Plot lines with different line styles plt.plot(y,'^',label = 'Netflix subscribers',ls='-') # get current axes and store it to ax ax = plt.gca() # set ticks in multiples for both labels ax.xaxis.set_major_locator(ticker.MultipleLocator(4)) # set major marks # every 4 quarters, ie once a year ax.xaxis.set_minor_locator(ticker.MultipleLocator(1)) # set minor marks # for each quarter ax.yaxis.set_major_locator(ticker.MultipleLocator(10)) # ax.yaxis.set_minor_locator(ticker.MultipleLocator(2)) # label the start of each year by FixedFormatter  ax.get_xaxis().set_major_formatter(ticker.FixedFormatter(x)) plt.legend() plt.show() From this plot, we see that Netflix has a pretty linear growth of subscribers from the year 2012 to 2017. We can tell the seasonal growth better after formatting the x axis in a quarterly manner. In 2016, Netflix was doing better in the latter half of the year. Any TV shows you watched in each season? Rotating tick labels A figure can get too crowded or some tick labels may get skipped when we have too many tick labels or if the label strings are too long. We can solve this by rotating the ticks, for example, by pyplot.xticks(rotation=60): import matplotlib.pyplot as plt import numpy as np import matplotlib as mpl mpl.style.use('seaborn') techs = ['Google Adsense','DoubleClick.Net','Facebook Custom Audiences','Google Publisher Tag', 'App Nexus'] y_pos = np.arange(len(techs)) # Number of websites using the advertising technologies # Data were quoted from builtwith.com on May 8th 2017 websites = [14409195,1821385,948344,176310,283766] plt.bar(y_pos, websites, align='center', alpha=0.5) # set x-axis tick rotation plt.xticks(y_pos, techs, rotation=25) plt.ylabel('Live site count') plt.title('Online advertising technologies usage') plt.show() Use pyplot.tight_layout() to avoid image clipping. Using rotated labels can sometimes result in image clipping, as follows, if you save the figure by pyplot.savefig(). You can call pyplot.tight_layout() before pyplot.savefig() to ensure a complete image output. We saw how ticks can be adjusted, customized, rotated and formatted in Matplotlib 2.0 for easy readability, labelling and estimation of values. To become well-versed with Matplotlib for your day to day work, check out this book Matplotlib 2.x By Example.  
Read more
  • 0
  • 0
  • 18047
article-image-optical-training-of-neural-networks-is-making-ai-more-efficient
Natasha Mathur
20 Jul 2018
3 min read
Save for later

Optical training of Neural networks is making AI more efficient

Natasha Mathur
20 Jul 2018
3 min read
According to research conducted by T. W. Hughes, M. Minkov, Y. Shi, and S. Fan, artificial neural networks can be directly trained on an optical chip. The research, titled “Training of photonic neural networks through in situ backpropagation and gradient measurement” demonstrates that an optical circuit has all the capabilities to perform the critical functions of an electronics-based artificial neural network. This makes performing complex tasks like speech or image recognition less expensive, faster and more energy efficient. According to research team leader, Shanhui Fan of Stanford University "Using an optical chip to perform neural network computations more efficiently than is possible with digital computers could allow more complex problems to be solved”. During the research, the training step on optical ANNs was performed using a traditional digital computer. The final settings were then imported into the optical circuit. But, according to Optica (the Optical Society journal for high impact research at Stanford),. there is a more direct method for training these networks. This involves making use of an optical analog within the ‘backpropagation' algorithm. Tyler W. Hughes, the first author of the research paper, states that "using a physical device rather than a computer model for training makes the process more accurate”.  He also mentions that “because the training step is a very computationally expensive part of the implementation of the neural network, performing this step optically is key to improving the computational efficiency, speed and power consumption of artificial networks." Neural network processing is usually performed with the help of a traditional computer. But now, for neural network computing, researchers are interested in Optics-based devices as computations performed on these devices use much less energy compared to electronic devices. In New York researchers designed an optical chip that imitates the way, conventional computers train neural networks. This then provides a way of implementing an all-optical neural network. According to Hughes, the ANN is like a black box with a number of knobs. During the training stage, each knob is turned ever so slightly so the system can be tested to see how the algorithm’s performance changes. He says, “Our method not only helps predict which direction to turn the knobs but also how much you should turn each knob to get you closer to the desired performance”. How does the new training protocol work? This new training method uses optical circuits which have tunable beam splitters. You can adjust these spitters by altering the settings of optical phase shifters. First, you feed a laser which is encoded with information that needs to be processed through the optical circuit. Once the laser exits the device, the difference against the expected outcome is calculated. This information that is collected then generates a new light signal through the optical network in the opposite direction. Researchers also showed that neural network performance changes with respect to each beam splitter's setting. You can also change the phase shifter settings based on this information. The whole process is repeated until the desired outcome is produced by the neural network. This training technique has been further tested by researchers using optical simulations. In these tests, the optical implementation performed similarly to a conventional computer. The researchers are planning to further optimize the system in order to come out with a practical application using a neural network. How Deep Neural Networks can improve Speech Recognition and generation Recurrent neural networks and the LSTM architecture  
Read more
  • 0
  • 0
  • 18045

article-image-building-recommendation-engine-spark
Packt
24 Feb 2016
44 min read
Save for later

Building a Recommendation Engine with Spark

Packt
24 Feb 2016
44 min read
In this article, we will explore individual machine learning models in detail, starting with recommendation engines. (For more resources related to this topic, see here.) Recommendation engines are probably among the best types of machine learning model known to the general public. Even if people do not know exactly what a recommendation engine is, they have most likely experienced one through the use of popular websites such as Amazon, Netflix, YouTube, Twitter, LinkedIn, and Facebook. Recommendations are a core part of all these businesses, and in some cases, they drive significant percentages of their revenue. The idea behind recommendation engines is to predict what people might like and to uncover relationships between items to aid in the discovery process (in this way, it is similar and, in fact, often complementary to search engines, which also play a role in discovery). However, unlike search engines, recommendation engines try to present people with relevant content that they did not necessarily search for or that they might not even have heard of. Typically, a recommendation engine tries to model the connections between users and some type of item. If we can do a good job of showing our users movies related to a given movie, we could aid in discovery and navigation on our site, again improving our users' experience, engagement, and the relevance of our content to them. However, recommendation engines are not limited to movies, books, or products. The techniques we will explore in this article can be applied to just about any user-to-item relationship as well as user-to-user connections, such as those found on social networks, allowing us to make recommendations such as people you may know or who to follow. Recommendation engines are most effective in two general scenarios (which are not mutually exclusive). They are explained here: Large number of available options for users: When there are a very large number of available items, it becomes increasingly difficult for the user to find something they want. Searching can help when the user knows what they are looking for, but often, the right item might be something previously unknown to them. In this case, being recommended relevant items, that the user may not already know about, can help them discover new items. A significant degree of personal taste involved: When personal taste plays a large role in selection, recommendation models, which often utilize a wisdom of the crowd approach, can be helpful in discovering items based on the behavior of others that have similar taste profiles. In this article, we will: Introduce the various types of recommendation engines Build a recommendation model using data about user preferences Use the trained model to compute recommendations for a given user as well compute similar items for a given item (that is, related items) Apply standard evaluation metrics to the model that we created to measure how well it performs in terms of predictive capability Types of recommendation models Recommender systems are widely studied, and there are many approaches used, but there are two that are probably most prevalent: content-based filtering and collaborative filtering. Recently, other approaches such as ranking models have also gained in popularity. In practice, many approaches are hybrids, incorporating elements of many different methods into a model or combination of models. Content-based filtering Content-based methods try to use the content or attributes of an item, together with some notion of similarity between two pieces of content, to generate items similar to a given item. These attributes are often textual content (such as titles, names, tags, and other metadata attached to an item), or in the case of media, they could include other features of the item, such as attributes extracted from audio and video content. In a similar manner, user recommendations can be generated based on attributes of users or user profiles, which are then matched to item attributes using the same measure of similarity. For example, a user can be represented by the combined attributes of the items they have interacted with. This becomes their user profile, which is then compared to item attributes to find items that match the user profile. Collaborative filtering Collaborative filtering is a form of wisdom of the crowd approach where the set of preferences of many users with respect to items is used to generate estimated preferences of users for items with which they have not yet interacted. The idea behind this is the notion of similarity. In a user-based approach, if two users have exhibited similar preferences (that is, patterns of interacting with the same items in broadly the same way), then we would assume that they are similar to each other in terms of taste. To generate recommendations for unknown items for a given user, we can use the known preferences of other users that exhibit similar behavior. We can do this by selecting a set of similar users and computing some form of combined score based on the items they have shown a preference for. The overall logic is that if others have tastes similar to a set of items, these items would tend to be good candidates for recommendation. We can also take an item-based approach that computes some measure of similarity between items. This is usually based on the existing user-item preferences or ratings. Items that tend to be rated the same by similar users will be classed as similar under this approach. Once we have these similarities, we can represent a user in terms of the items they have interacted with and find items that are similar to these known items, which we can then recommend to the user. Again, a set of items similar to the known items is used to generate a combined score to estimate for an unknown item. The user- and item-based approaches are usually referred to as nearest-neighbor models, since the estimated scores are computed based on the set of most similar users or items (that is, their neighbors). Finally, there are many model-based methods that attempt to model the user-item preferences themselves so that new preferences can be estimated directly by applying the model to unknown user-item combinations. Matrix factorization Since Spark's recommendation models currently only include an implementation of matrix factorization, we will focus our attention on this class of models. This focus is with good reason; however, these types of models have consistently been shown to perform extremely well in collaborative filtering and were among the best models in well-known competitions such as the Netflix prize. For more information on and a brief overview of the performance of the best algorithms for the Netflix prize, see http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html. Explicit matrix factorization When we deal with data that consists of preferences of users that are provided by the users themselves, we refer to explicit preference data. This includes, for example, ratings, thumbs up, likes, and so on that are given by users to items. We can take these ratings and form a two-dimensional matrix with users as rows and items as columns. Each entry represents a rating given by a user to a certain item. Since in most cases, each user has only interacted with a relatively small set of items, this matrix has only a few non-zero entries (that is, it is very sparse). As a simple example, let's assume that we have the following user ratings for a set of movies: Tom, Star Wars, 5 Jane, Titanic, 4 Bill, Batman, 3 Jane, Star Wars, 2 Bill, Titanic, 3 We will form the following ratings matrix: A simple movie-rating matrix Matrix factorization (or matrix completion) attempts to directly model this user-item matrix by representing it as a product of two smaller matrices of lower dimension. Thus, it is a dimensionality-reduction technique. If we have U users and I items, then our user-item matrix is of dimension U x I and might look something like the one shown in the following diagram: A sparse ratings matrix If we want to find a lower dimension (low-rank) approximation to our user-item matrix with the dimension k, we would end up with two matrices: one for users of size U x k and one for items of size I x k. These are known as factor matrices. If we multiply these two factor matrices, we would reconstruct an approximate version of the original ratings matrix. Note that while the original ratings matrix is typically very sparse, each factor matrix is dense, as shown in the following diagram: The user- and item-factor matrices These models are often also called latent feature models, as we are trying to discover some form of hidden features (which are represented by the factor matrices) that account for the structure of behavior inherent in the user-item rating matrix. While the latent features or factors are not directly interpretable, they might, perhaps, represent things such as the tendency of a user to like movies from a certain director, genre, style, or group of actors, for example. As we are directly modeling the user-item matrix, the prediction in these models is relatively straightforward: to compute a predicted rating for a user and item, we compute the vector dot product between the relevant row of the user-factor matrix (that is, the user's factor vector) and the relevant row of the item-factor matrix (that is, the item's factor vector). This is illustrated with the highlighted vectors in the following diagram: Computing recommendations from user- and item-factor vectors To find out the similarity between two items, we can use the same measures of similarity as we would use in the nearest-neighbor models, except that we can use the factor vectors directly by computing the similarity between two item-factor vectors, as illustrated in the following diagram: Computing similarity with item-factor vectors The benefit of factorization models is the relative ease of computing recommendations once the model is created. However, for very large user and itemsets, this can become a challenge as it requires storage and computation across potentially many millions of user- and item-factor vectors. Another advantage, as mentioned earlier, is that they tend to offer very good performance. Projects such as Oryx (https://github.com/OryxProject/oryx) and Prediction.io (https://github.com/PredictionIO/PredictionIO) focus on model serving for large-scale models, including recommenders based on matrix factorization. On the down side, factorization models are relatively more complex to understand and interpret compared to nearest-neighbor models and are often more computationally intensive during the model's training phase. Implicit matrix factorization So far, we have dealt with explicit preferences such as ratings. However, much of the preference data that we might be able to collect is implicit feedback, where the preferences between a user and item are not given to us, but are, instead, implied from the interactions they might have with an item. Examples include binary data (such as whether a user viewed a movie, whether they purchased a product, and so on) as well as count data (such as the number of times a user watched a movie). There are many different approaches to deal with implicit data. MLlib implements a particular approach that treats the input rating matrix as two matrices: a binary preference matrix, P, and a matrix of confidence weights, C. For example, let's assume that the user-movie ratings we saw previously were, in fact, the number of times each user had viewed that movie. The two matrices would look something like ones shown in the following screenshot. Here, the matrix P informs us that a movie was viewed by a user, and the matrix C represents the confidence weighting, in the form of the view counts—generally, the more a user has watched a movie, the higher the confidence that they actually like it. Representation of an implicit preference and confidence matrix The implicit model still creates a user- and item-factor matrix. In this case, however, the matrix that the model is attempting to approximate is not the overall ratings matrix but the preference matrix P. If we compute a recommendation by calculating the dot product of a user- and item-factor vector, the score will not be an estimate of a rating directly. It will rather be an estimate of the preference of a user for an item (though not strictly between 0 and 1, these scores will generally be fairly close to a scale of 0 to 1). Alternating least squares Alternating Least Squares (ALS) is an optimization technique to solve matrix factorization problems; this technique is powerful, achieves good performance, and has proven to be relatively easy to implement in a parallel fashion. Hence, it is well suited for platforms such as Spark. At the time of writing this, it is the only recommendation model implemented in MLlib. ALS works by iteratively solving a series of least squares regression problems. In each iteration, one of the user- or item-factor matrices is treated as fixed, while the other one is updated using the fixed factor and the rating data. Then, the factor matrix that was solved for is, in turn, treated as fixed, while the other one is updated. This process continues until the model has converged (or for a fixed number of iterations). Spark's documentation for collaborative filtering contains references to the papers that underlie the ALS algorithms implemented each component of explicit and implicit data. You can view the documentation at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. Extracting the right features from your data In this section, we will use explicit rating data, without additional user or item metadata or other information related to the user-item interactions. Hence, the features that we need as inputs are simply the user IDs, movie IDs, and the ratings assigned to each user and movie pair. Extracting features from the MovieLens 100k dataset Start the Spark shell in the Spark base directory, ensuring that you provide enough memory via the –driver-memory option: >./bin/spark-shell –driver-memory 4g In this example, we will use the same MovieLens dataset. Use the directory in which you placed the MovieLens 100k dataset as the input path in the following code. First, let's inspect the raw ratings dataset: val rawData = sc.textFile("/PATH/ml-100k/u.data") rawData.first() You will see output similar to these lines of code: 14/03/30 11:42:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/03/30 11:42:41 WARN LoadSnappy: Snappy native library not loaded 14/03/30 11:42:41 INFO FileInputFormat: Total input paths to process : 1 14/03/30 11:42:41 INFO SparkContext: Starting job: first at <console>:15 14/03/30 11:42:41 INFO DAGScheduler: Got job 0 (first at <console>:15) with 1 output partitions (allowLocal=true) 14/03/30 11:42:41 INFO DAGScheduler: Final stage: Stage 0 (first at <console>:15) 14/03/30 11:42:41 INFO DAGScheduler: Parents of final stage: List() 14/03/30 11:42:41 INFO DAGScheduler: Missing parents: List() 14/03/30 11:42:41 INFO DAGScheduler: Computing the requested partition locally 14/03/30 11:42:41 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 11:42:41 INFO SparkContext: Job finished: first at <console>:15, took 0.030533 s res0: String = 196  242  3  881250949 Recall that this dataset consisted of the user id, movie id, rating, timestamp fields separated by a tab ("t") character. We don't need the time when the rating was made to train our model, so let's simply extract the first three fields: val rawRatings = rawData.map(_.split("t").take(3)) We will first split each record on the "t" character, which gives us an Array[String] array. We will then use Scala's take function to keep only the first 3 elements of the array, which correspond to user id, movie id, and rating, respectively. We can inspect the first record of our new RDD by calling rawRatings.first(), which collects just the first record of the RDD back to the driver program. This will result in the following output: 14/03/30 12:24:00 INFO SparkContext: Starting job: first at <console>:21 14/03/30 12:24:00 INFO DAGScheduler: Got job 1 (first at <console>:21) with 1 output partitions (allowLocal=true) 14/03/30 12:24:00 INFO DAGScheduler: Final stage: Stage 1 (first at <console>:21) 14/03/30 12:24:00 INFO DAGScheduler: Parents of final stage: List() 14/03/30 12:24:00 INFO DAGScheduler: Missing parents: List() 14/03/30 12:24:00 INFO DAGScheduler: Computing the requested partition locally 14/03/30 12:24:00 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 12:24:00 INFO SparkContext: Job finished: first at <console>:21, took 0.00391 s res6: Array[String] = Array(196, 242, 3) We will use Spark's MLlib library to train our model. Let's take a look at what methods are available for us to use and what input is required. First, import the ALS model from MLlib: import org.apache.spark.mllib.recommendation.ALS On the console, we can inspect the available methods on the ALS object using tab completion. Type in ALS. (note the dot) and then press the Tab key. You should see the autocompletion of the methods: ALS. asInstanceOf    isInstanceOf    main            toString        train           trainImplicit The method we want to use is train. If we type ALS.train and hit Enter, we will get an error. However, this error will tell us what the method signature looks like: ALS.train <console>:12: error: ambiguous reference to overloaded definition, both method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int)org.apache.spark.mllib.recommendation.MatrixFactorizationModel and  method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int, lambda: Double)org.apache.spark.mllib.recommendation.MatrixFactorizationModel match expected type ?               ALS.train                   ^ So, we can see that at a minimum, we need to provide the input arguments, ratings, rank, and iterations. The second method also requires an argument called lambda. We'll cover these three shortly, but let's take a look at the ratings argument. First, let's import the Rating class that it references and use a similar approach to find out what an instance of Rating requires, by typing in Rating() and hitting Enter: import org.apache.spark.mllib.recommendation.Rating Rating() <console>:13: error: not enough arguments for method apply: (user: Int, product: Int, rating: Double)org.apache.spark.mllib.recommendation.Rating in object Rating. Unspecified value parameters user, product, rating.               Rating()                     ^ As we can see from the preceding output, we need to provide the ALS model with an RDD that consists of Rating records. A Rating class, in turn, is just a wrapper around user id, movie id (called product here), and the actual rating arguments. We'll create our rating dataset using the map method and transforming the array of IDs and ratings into a Rating object: val ratings = rawRatings.map { case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) } Notice that we need to use toInt or toDouble to convert the raw rating data (which was extracted as Strings from the text file) to Int or Double numeric inputs. Also, note the use of a case statement that allows us to extract the relevant variable names and use them directly (this saves us from having to use something like val user = ratings(0)). For more on Scala case statements and pattern matching as used here, take a look at http://docs.scala-lang.org/tutorials/tour/pattern-matching.html. We now have an RDD[Rating] that we can verify by calling: ratings.first() 14/03/30 12:32:48 INFO SparkContext: Starting job: first at <console>:24 14/03/30 12:32:48 INFO DAGScheduler: Got job 2 (first at <console>:24) with 1 output partitions (allowLocal=true) 14/03/30 12:32:48 INFO DAGScheduler: Final stage: Stage 2 (first at <console>:24) 14/03/30 12:32:48 INFO DAGScheduler: Parents of final stage: List() 14/03/30 12:32:48 INFO DAGScheduler: Missing parents: List() 14/03/30 12:32:48 INFO DAGScheduler: Computing the requested partition locally 14/03/30 12:32:48 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 12:32:48 INFO SparkContext: Job finished: first at <console>:24, took 0.003752 s res8: org.apache.spark.mllib.recommendation.Rating = Rating(196,242,3.0) Training the recommendation model Once we have extracted these simple features from our raw data, we are ready to proceed with model training; MLlib takes care of this for us. All we have to do is provide the correctly-parsed input RDD we just created as well as our chosen model parameters. Training a model on the MovieLens 100k dataset We're now ready to train our model! The other inputs required for our model are as follows: rank: This refers to the number of factors in our ALS model, that is, the number of hidden features in our low-rank approximation matrices. Generally, the greater the number of factors, the better, but this has a direct impact on memory usage, both for computation and to store models for serving, particularly for large number of users or items. Hence, this is often a trade-off in real-world use cases. A rank in the range of 10 to 200 is usually reasonable. iterations: This refers to the number of iterations to run. While each iteration in ALS is guaranteed to decrease the reconstruction error of the ratings matrix, ALS models will converge to a reasonably good solution after relatively few iterations. So, we don't need to run for too many iterations in most cases (around 10 is often a good default). lambda: This parameter controls the regularization of our model. Thus, lambda controls over fitting. The higher the value of lambda, the more is the regularization applied. What constitutes a sensible value is very dependent on the size, nature, and sparsity of the underlying data, and as with almost all machine learning models, the regularization parameter is something that should be tuned using out-of-sample test data and cross-validation approaches. We'll use rank of 50, 10 iterations, and a lambda parameter of 0.01 to illustrate how to train our model: val model = ALS.train(ratings, 50, 10, 0.01) This returns a MatrixFactorizationModel object, which contains the user and item factors in the form of an RDD of (id, factor) pairs. These are called userFeatures and productFeatures, respectively. For example: model.userFeatures res14: org.apache.spark.rdd.RDD[(Int, Array[Double])] = FlatMappedRDD[659] at flatMap at ALS.scala:231 We can see that the factors are in the form of an Array[Double]. Note that the operations used in MLlib's ALS implementation are lazy transformations, so the actual computation will only be performed once we call some sort of action on the resulting RDDs of the user and item factors. We can force the computation using a Spark action such as count: model.userFeatures.count This will trigger the computation, and we will see a quite a bit of output text similar to the following lines of code: 14/03/30 13:10:40 INFO SparkContext: Starting job: count at <console>:26 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 665 (map at ALS.scala:147) 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 664 (map at ALS.scala:146) 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 674 (mapPartitionsWithIndex at ALS.scala:164) ... 14/03/30 13:10:45 INFO SparkContext: Job finished: count at <console>:26, took 5.068255 s res16: Long = 943 If we call count for the movie factors, we will see the following output: model.productFeatures.count 14/03/30 13:15:21 INFO SparkContext: Starting job: count at <console>:26 14/03/30 13:15:21 INFO DAGScheduler: Got job 10 (count at <console>:26) with 1 output partitions (allowLocal=false) 14/03/30 13:15:21 INFO DAGScheduler: Final stage: Stage 165 (count at <console>:26) 14/03/30 13:15:21 INFO DAGScheduler: Parents of final stage: List(Stage 169, Stage 166) 14/03/30 13:15:21 INFO DAGScheduler: Missing parents: List() 14/03/30 13:15:21 INFO DAGScheduler: Submitting Stage 165 (FlatMappedRDD[883] at flatMap at ALS.scala:231), which has no missing parents 14/03/30 13:15:21 INFO DAGScheduler: Submitting 1 missing tasks from Stage 165 (FlatMappedRDD[883] at flatMap at ALS.scala:231) ... 14/03/30 13:15:21 INFO SparkContext: Job finished: count at <console>:26, took 0.030044 s res21: Long = 1682 As expected, we have a factor array for each user (943 factors) and movie (1682 factors). Training a model using implicit feedback data The standard matrix factorization approach in MLlib deals with explicit ratings. To work with implicit data, you can use the trainImplicit method. It is called in a manner similar to the standard train method. There is an additional parameter, alpha, that can be set (and in the same way, the regularization parameter, lambda, should be selected via testing and cross-validation methods). The alpha parameter controls the baseline level of confidence weighting applied. A higher level of alpha tends to make the model more confident about the fact that missing data equates to no preference for the relevant user-item pair. As an exercise, try to take the existing MovieLens dataset and convert it into an implicit dataset. One possible approach is to convert it to binary feedback (0s and 1s) by applying a threshold on the ratings at some level. Another approach could be to convert the ratings' values into confidence weights (for example, perhaps, low ratings could imply zero weights, or even negative weights, which are supported by MLlib's implementation). Train a model on this dataset and compare the results of the following section with those generated by your implicit model. Using the recommendation model Now that we have our trained model, we're ready to use it to make predictions. These predictions typically take one of two forms: recommendations for a given user and related or similar items for a given item. User recommendations In this case, we would like to generate recommended items for a given user. This usually takes the form of a top-K list, that is, the K items that our model predicts will have the highest probability of the user liking them. This is done by computing the predicted score for each item and ranking the list based on this score. The exact method to perform this computation depends on the model involved. For example, in user-based approaches, the ratings of similar users on items are used to compute the recommendations for a user, while in an item-based approach, the computation is based on the similarity of items the user has rated to the candidate items. In matrix factorization, because we are modeling the ratings matrix directly, the predicted score can be computed as the vector dot product between a user-factor vector and an item-factor vector. Generating movie recommendations from the MovieLens 100k dataset As MLlib's recommendation model is based on matrix factorization, we can use the factor matrices computed by our model to compute predicted scores (or ratings) for a user. We will focus on the explicit rating case using MovieLens data; however, the approach is the same when using the implicit model. The MatrixFactorizationModel class has a convenient predict method that will compute a predicted score for a given user and item combination: val predictedRating = model.predict(789, 123) 14/03/30 16:10:10 INFO SparkContext: Starting job: lookup at MatrixFactorizationModel.scala:45 14/03/30 16:10:10 INFO DAGScheduler: Got job 30 (lookup at MatrixFactorizationModel.scala:45) with 1 output partitions (allowLocal=false) ... 14/03/30 16:10:10 INFO SparkContext: Job finished: lookup at MatrixFactorizationModel.scala:46, took 0.023077 s predictedRating: Double = 3.128545693368485 As we can see, this model predicts a rating of 3.12 for user 789 and movie 123. Note that you might see different results than those shown in this section because the ALS model is initialized randomly. So, different runs of the model will lead to different solutions.  The predict method can also take an RDD of (user, item) IDs as the input and will generate predictions for each of these. We can use this method to make predictions for many users and items at the same time. To generate the top-K recommended items for a user, MatrixFactorizationModel provides a convenience method called recommendProducts. This takes two arguments: user and num, where user is the user ID, and num is the number of items to recommend. It returns the top num items ranked in the order of the predicted score. Here, the scores are computed as the dot product between the user-factor vector and each item-factor vector. Let's generate the top 10 recommended items for user 789: val userId = 789 val K = 10 val topKRecs = model.recommendProducts(userId, K) We now have a set of predicted ratings for each movie for user 789. If we print this out, we could inspect the top 10 recommendations for this user: println(topKRecs.mkString("n")) You should see the following output on your console: Rating(789,715,5.931851273771102) Rating(789,12,5.582301095666215) Rating(789,959,5.516272981542168) Rating(789,42,5.458065302395629) Rating(789,584,5.449949837103569) Rating(789,750,5.348768847643657) Rating(789,663,5.30832117499004) Rating(789,134,5.278933936827717) Rating(789,156,5.250959077906759) Rating(789,432,5.169863417126231) Inspecting the recommendations We can give these recommendations a sense check by taking a quick look at the titles of the movies a user has rated and the recommended movies. First, we need to load the movie data. We'll collect this data as a Map[Int, String] method mapping the movie ID to the title: val movies = sc.textFile("/PATH/ml-100k/u.item") val titles = movies.map(line => line.split("\|").take(2)).map(array => (array(0).toInt,  array(1))).collectAsMap() titles(123) res68: String = Frighteners, The (1996) For our user 789, we can find out what movies they have rated, take the 10 movies with the highest rating, and then check the titles. We will do this now by first using the keyBy Spark function to create an RDD of key-value pairs from our ratings RDD, where the key will be the user ID. We will then use the lookup function to return just the ratings for this key (that is, that particular user ID) to the driver: val moviesForUser = ratings.keyBy(_.user).lookup(789) Let's see how many movies this user has rated. This will be the size of the moviesForUser collection: println(moviesForUser.size) We will see that this user has rated 33 movies. Next, we will take the 10 movies with the highest ratings by sorting the moviesForUser collection using the rating field of the Rating object. We will then extract the movie title for the relevant product ID attached to the Rating class from our mapping of movie titles and print out the top 10 titles with their ratings: moviesForUser.sortBy(-_.rating).take(10).map(rating => (titles(rating.product), rating.rating)).foreach(println) You will see the following output displayed: (Godfather, The (1972),5.0) (Trainspotting (1996),5.0) (Dead Man Walking (1995),5.0) (Star Wars (1977),5.0) (Swingers (1996),5.0) (Leaving Las Vegas (1995),5.0) (Bound (1996),5.0) (Fargo (1996),5.0) (Last Supper, The (1995),5.0) (Private Parts (1997),4.0) Now, let's take a look at the top 10 recommendations for this user and see what the titles are using the same approach as the one we used earlier (note that the recommendations are already sorted): topKRecs.map(rating => (titles(rating.product), rating.rating)).foreach(println) (To Die For (1995),5.931851273771102) (Usual Suspects, The (1995),5.582301095666215) (Dazed and Confused (1993),5.516272981542168) (Clerks (1994),5.458065302395629) (Secret Garden, The (1993),5.449949837103569) (Amistad (1997),5.348768847643657) (Being There (1979),5.30832117499004) (Citizen Kane (1941),5.278933936827717) (Reservoir Dogs (1992),5.250959077906759) (Fantasia (1940),5.169863417126231) We leave it to you to decide whether these recommendations make sense. Item recommendations Item recommendations are about answering the following question: for a certain item, what are the items most similar to it? Here, the precise definition of similarity is dependent on the model involved. In most cases, similarity is computed by comparing the vector representation of two items using some similarity measure. Common similarity measures include Pearson correlation and cosine similarity for real-valued vectors and Jaccard similarity for binary vectors. Generating similar movies for the MovieLens 100K dataset The current MatrixFactorizationModel API does not directly support item-to-item similarity computations. Therefore, we will need to create our own code to do this. We will use the cosine similarity metric, and we will use the jblas linear algebra library (a dependency of MLlib) to compute the required vector dot products. This is similar to how the existing predict and recommendProducts methods work, except that we will use cosine similarity as opposed to just the dot product. We would like to compare the factor vector of our chosen item with each of the other items, using our similarity metric. In order to perform linear algebra computations, we will first need to create a vector object out of the factor vectors, which are in the form of an Array[Double]. The JBLAS class, DoubleMatrix, takes an Array[Double] as the constructor argument as follows: import org.jblas.DoubleMatrix val aMatrix = new DoubleMatrix(Array(1.0, 2.0, 3.0)) aMatrix: org.jblas.DoubleMatrix = [1.000000; 2.000000; 3.000000] Note that using jblas, vectors are represented as a one-dimensional DoubleMatrix class, while matrices are a two-dimensional DoubleMatrix class. We will need a method to compute the cosine similarity between two vectors. Cosine similarity is a measure of the angle between two vectors in an n-dimensional space. It is computed by first calculating the dot product between the vectors and then dividing the result by a denominator, which is the norm (or length) of each vector multiplied together (specifically, the L2-norm is used in cosine similarity). In this way, cosine similarity is a normalized dot product. The cosine similarity measure takes on values between -1 and 1. A value of 1 implies completely similar, while a value of 0 implies independence (that is, no similarity). This measure is useful because it also captures negative similarity, that is, a value of -1 implies that not only are the vectors not similar, but they are also completely dissimilar. Let's create our cosineSimilarity function here: def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = {   vec1.dot(vec2) / (vec1.norm2() * vec2.norm2()) } Note that we defined a return type for this function of Double. We are not required to do this, since Scala features type inference. However, it can often be useful to document return types for Scala functions. Let's try it out on one of our item factors for item 567. We will need to collect an item factor from our model; we will do this using the lookup method in a similar way that we did earlier to collect the ratings for a specific user. In the following lines of code, we also use the head function, since lookup returns an array of values, and we only need the first value (in fact, there will only be one value, which is the factor vector for this item). Since this will be an Array[Double], we will then need to create a DoubleMatrix object from it and compute the cosine similarity with itself: val itemId = 567 val itemFactor = model.productFeatures.lookup(itemId).head val itemVector = new DoubleMatrix(itemFactor) cosineSimilarity(itemVector, itemVector) A similarity metric should measure how close, in some sense, two vectors are to each other. Here, we can see that our cosine similarity metric tells us that this item vector is identical to itself, which is what we would expect: res113: Double = 1.0 Now, we are ready to apply our similarity metric to each item: val sims = model.productFeatures.map{ case (id, factor) =>  val factorVector = new DoubleMatrix(factor)   val sim = cosineSimilarity(factorVector, itemVector)   (id, sim) } Next, we can compute the top 10 most similar items by sorting out the similarity score for each item: // recall we defined K = 10 earlier val sortedSims = sims.top(K)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity }) In the preceding code snippet, we used Spark's top function, which is an efficient way to compute top-K results in a distributed fashion, instead of using collect to return all the data to the driver and sorting it locally (remember that we could be dealing with millions of users and items in the case of recommendation models). We need to tell Spark how to sort the (item id, similarity score) pairs in the sims RDD. To do this, we will pass an extra argument to top, which is a Scala Ordering object that tells Spark that it should sort by the value in the key-value pair (that is, sort by similarity). Finally, we can print the 10 items with the highest computed similarity metric to our given item: println(sortedSims.take(10).mkString("n")) You will see output like the following one: (567,1.0000000000000002) (1471,0.6932331537649621) (670,0.6898690594544726) (201,0.6897964975027041) (343,0.6891221044611473) (563,0.6864214133620066) (294,0.6812075443259535) (413,0.6754663844488256) (184,0.6702643811753909) (109,0.6594872765176396) Not surprisingly, we can see that the top-ranked similar item is our item. The rest are the other items in our set of items, ranked in order of our similarity metric. Inspecting the similar items Let's see what the title of our chosen movie is: println(titles(itemId)) Wes Craven's New Nightmare (1994) As we did for user recommendations, we can sense check our item-to-item similarity computations and take a look at the titles of the most similar movies. This time, we will take the top 11 so that we can exclude our given movie. So, we will take the numbers 1 to 11 in the list: val sortedSims2 = sims.top(K + 1)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity }) sortedSims2.slice(1, 11).map{ case (id, sim) => (titles(id), sim) }.mkString("n") You will see the movie titles and scores displayed similar to this output: (Hideaway (1995),0.6932331537649621) (Body Snatchers (1993),0.6898690594544726) (Evil Dead II (1987),0.6897964975027041) (Alien: Resurrection (1997),0.6891221044611473) (Stephen King's The Langoliers (1995),0.6864214133620066) (Liar Liar (1997),0.6812075443259535) (Tales from the Crypt Presents: Bordello of Blood (1996),0.6754663844488256) (Army of Darkness (1993),0.6702643811753909) (Mystery Science Theater 3000: The Movie (1996),0.6594872765176396) (Scream (1996),0.6538249646863378) Once again note that you might see quite different results due to random model initialization. Now that you have computed similar items using cosine similarity, see if you can do the same with the user-factor vectors to compute similar users for a given user. Evaluating the performance of recommendation models How do we know whether the model we have trained is a good model? We need to be able to evaluate its predictive performance in some way. Evaluation metrics are measures of a model's predictive capability or accuracy. Some are direct measures of how well a model predicts the model's target variable (such as Mean Squared Error), while others are concerned with how well the model performs at predicting things that might not be directly optimized in the model but are often closer to what we care about in the real world (such as Mean average precision). Evaluation metrics provide a standardized way of comparing the performance of the same model with different parameter settings and of comparing performance across different models. Using these metrics, we can perform model selection to choose the best-performing model from the set of models we wish to evaluate. Here, we will show you how to calculate two common evaluation metrics used in recommender systems and collaborative filtering models: Mean Squared Error and Mean average precision at K. Mean Squared Error The Mean Squared Error (MSE) is a direct measure of the reconstruction error of the user-item rating matrix. It is also the objective function being minimized in certain models, specifically many matrix-factorization techniques, including ALS. As such, it is commonly used in explicit ratings settings. It is defined as the sum of the squared errors divided by the number of observations. The squared error, in turn, is the square of the difference between the predicted rating for a given user-item pair and the actual rating. We will use our user 789 as an example. Let's take the first rating for this user from the moviesForUser set of Ratings that we previously computed: val actualRating = moviesForUser.take(1)(0) actualRating: org.apache.spark.mllib.recommendation.Rating = Rating(789,1012,4.0) We will see that the rating for this user-item combination is 4. Next, we will compute the model's predicted rating: val predictedRating = model.predict(789, actualRating.product) ... 14/04/13 13:01:15 INFO SparkContext: Job finished: lookup at MatrixFactorizationModel.scala:46, took 0.025404 s predictedRating: Double = 4.001005374200248 We will see that the predicted rating is about 4, very close to the actual rating. Finally, we will compute the squared error between the actual rating and the predicted rating: val squaredError = math.pow(predictedRating - actualRating.rating, 2.0) squaredError: Double = 1.010777282523947E-6 So, in order to compute the overall MSE for the dataset, we need to compute this squared error for each (user, movie, actual rating, predicted rating) entry, sum them up, and divide them by the number of ratings. We will do this in the following code snippet. Note the following code is adapted from the Apache Spark programming guide for ALS at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. First, we will extract the user and product IDs from the ratings RDD and make predictions for each user-item pair using model.predict. We will use the user-item pair as the key and the predicted rating as the value: val usersProducts = ratings.map{ case Rating(user, product, rating)  => (user, product)} val predictions = model.predict(usersProducts).map{     case Rating(user, product, rating) => ((user, product), rating) } Next, we extract the actual ratings and also map the ratings RDD so that the user-item pair is the key and the actual rating is the value. Now that we have two RDDs with the same form of key, we can join them together to create a new RDD with the actual and predicted ratings for each user-item combination: val ratingsAndPredictions = ratings.map{   case Rating(user, product, rating) => ((user, product), rating) }.join(predictions) Finally, we will compute the MSE by summing up the squared errors using reduce and dividing by the count method of the number of records: val MSE = ratingsAndPredictions.map{     case ((user, product), (actual, predicted)) =>  math.pow((actual - predicted), 2) }.reduce(_ + _) / ratingsAndPredictions.count println("Mean Squared Error = " + MSE) Mean Squared Error = 0.08231947642632852 It is common to use the Root Mean Squared Error (RMSE), which is just the square root of the MSE metric. This is somewhat more interpretable, as it is in the same units as the underlying data (that is, the ratings in this case). It is equivalent to the standard deviation of the differences between the predicted and actual ratings. We can compute it simply as follows: val RMSE = math.sqrt(MSE) println("Root Mean Squared Error = " + RMSE) Root Mean Squared Error = 0.2869137090247319 Mean average precision at K Mean average precision at K (MAPK) is the mean of the average precision at K (APK) metric across all instances in the dataset. APK is a metric commonly used in information retrieval. APK is a measure of the average relevance scores of a set of the top-K documents presented in response to a query. For each query instance, we will compare the set of top-K results with the set of actual relevant documents (that is, a ground truth set of relevant documents for the query). In the APK metric, the order of the result set matters, in that, the APK score would be higher if the result documents are both relevant and the relevant documents are presented higher in the results. It is, thus, a good metric for recommender systems in that typically we would compute the top-K recommended items for each user and present these to the user. Of course, we prefer models where the items with the highest predicted scores (which are presented at the top of the list of recommendations) are, in fact, the most relevant items for the user. APK and other ranking-based metrics are also more appropriate evaluation measures for implicit datasets; here, MSE makes less sense. In order to evaluate our model, we can use APK, where each user is the equivalent of a query, and the set of top-K recommended items is the document result set. The relevant documents (that is, the ground truth) in this case, is the set of items that a user interacted with. Hence, APK attempts to measure how good our model is at predicting items that a user will find relevant and choose to interact with. The code for the following average precision computation is based on https://github.com/benhamner/Metrics.  More information on MAPK can be found at https://www.kaggle.com/wiki/MeanAveragePrecision. Our function to compute the APK is shown here: def avgPrecisionK(actual: Seq[Int], predicted: Seq[Int], k: Int): Double = {   val predK = predicted.take(k)   var score = 0.0   var numHits = 0.0   for ((p, i) <- predK.zipWithIndex) {     if (actual.contains(p)) {       numHits += 1.0       score += numHits / (i.toDouble + 1.0)     }   }   if (actual.isEmpty) {     1.0   } else {     score / scala.math.min(actual.size, k).toDouble   } } As you can see, this takes as input a list of actual item IDs that are associated with the user and another list of predicted ids so that our estimate will be relevant for the user. We can compute the APK metric for our example user 789 as follows. First, we will extract the actual movie IDs for the user: val actualMovies = moviesForUser.map(_.product) actualMovies: Seq[Int] = ArrayBuffer(1012, 127, 475, 93, 1161, 286, 293, 9, 50, 294, 181, 1, 1008, 508, 284, 1017, 137, 111, 742, 248, 249, 1007, 591, 150, 276, 151, 129, 100, 741, 288, 762, 628, 124) We will then use the movie recommendations we made previously to compute the APK score using K = 10: val predictedMovies = topKRecs.map(_.product) predictedMovies: Array[Int] = Array(27, 497, 633, 827, 602, 849, 401, 584, 1035, 1014) val apk10 = avgPrecisionK(actualMovies, predictedMovies, 10) apk10: Double = 0.0 In this case, we can see that our model is not doing a very good job of predicting relevant movies for this user as the APK score is 0. In order to compute the APK for each user and average them to compute the overall MAPK, we will need to generate the list of recommendations for each user in our dataset. While this can be fairly intensive on a large scale, we can distribute the computation using our Spark functionality. However, one limitation is that each worker must have the full item-factor matrix available so that it can compute the dot product between the relevant user vector and all item vectors. This can be a problem when the number of items is extremely high as the item matrix must fit in the memory of one machine. There is actually no easy way around this limitation. One possible approach is to only compute recommendations for a subset of items from the total item set, using approximate techniques such as Locality Sensitive Hashing (http://en.wikipedia.org/wiki/Locality-sensitive_hashing). We will now see how to go about this. First, we will collect the item factors and form a DoubleMatrix object from them: val itemFactors = model.productFeatures.map { case (id, factor) => factor }.collect() val itemMatrix = new DoubleMatrix(itemFactors) println(itemMatrix.rows, itemMatrix.columns) (1682,50) This gives us a matrix with 1682 rows and 50 columns, as we would expect from 1682 movies with a factor dimension of 50. Next, we will distribute the item matrix as a broadcast variable so that it is available on each worker node: val imBroadcast = sc.broadcast(itemMatrix) 14/04/13 21:02:01 INFO MemoryStore: ensureFreeSpace(672960) called with curMem=4006896, maxMem=311387750 14/04/13 21:02:01 INFO MemoryStore: Block broadcast_21 stored as values to memory (estimated size 657.2 KB, free 292.5 MB) imBroadcast: org.apache.spark.broadcast.Broadcast[org.jblas.DoubleMatrix] = Broadcast(21) Now we are ready to compute the recommendations for each user. We will do this by applying a map function to each user factor within which we will perform a matrix multiplication between the user-factor vector and the movie-factor matrix. The result is a vector (of length 1682, that is, the number of movies we have) with the predicted rating for each movie. We will then sort these predictions by the predicted rating: val allRecs = model.userFeatures.map{ case (userId, array) =>   val userVector = new DoubleMatrix(array)   val scores = imBroadcast.value.mmul(userVector)   val sortedWithId = scores.data.zipWithIndex.sortBy(-_._1)   val recommendedIds = sortedWithId.map(_._2 + 1).toSeq   (userId, recommendedIds) } allRecs: org.apache.spark.rdd.RDD[(Int, Seq[Int])] = MappedRDD[269] at map at <console>:29 As we can see, we now have an RDD that contains a list of movie IDs for each user ID. These movie IDs are sorted in order of the estimated rating. Note that we needed to add 1 to the returned movie ids (as highlighted in the preceding code snippet), as the item-factor matrix is 0-indexed, while our movie IDs start at 1. We also need the list of movie IDs for each user to pass into our APK function as the actual argument. We already have the ratings RDD ready, so we can extract just the user and movie IDs from it. If we use Spark's groupBy operator, we will get an RDD that contains a list of (userid, movieid) pairs for each user ID (as the user ID is the key on which we perform the groupBy operation): val userMovies = ratings.map{ case Rating(user, product, rating) => (user, product) }.groupBy(_._1) userMovies: org.apache.spark.rdd.RDD[(Int, Seq[(Int, Int)])] = MapPartitionsRDD[277] at groupBy at <console>:21 Finally, we can use Spark's join operator to join these two RDDs together on the user ID key. Then, for each user, we have the list of actual and predicted movie IDs that we can pass to our APK function. In a manner similar to how we computed MSE, we will sum each of these APK scores using a reduce action and divide by the number of users (that is, the count of the allRecs RDD): val K = 10 val MAPK = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2).toSeq   avgPrecisionK(actual, predicted, K) }.reduce(_ + _) / allRecs.count println("Mean Average Precision at K = " + MAPK) Mean Average Precision at K = 0.030486963254725705 Our model achieves a fairly low MAPK. However, note that typical values for recommendation tasks are usually relatively low, especially if the item set is extremely large. Try out a few parameter settings for lambda and rank (and alpha if you are using the implicit version of ALS) and see whether you can find a model that performs better based on the RMSE and MAPK evaluation metrics. Using MLlib's built-in evaluation functions While we have computed MSE, RMSE, and MAPK from scratch, and it a useful learning exercise to do so, MLlib provides convenience functions to do this for us in the RegressionMetrics and RankingMetrics classes. RMSE and MSE First, we will compute the MSE and RMSE metrics using RegressionMetrics. We will instantiate a RegressionMetrics instance by passing in an RDD of key-value pairs that represent the predicted and true values for each data point, as shown in the following code snippet. Here, we will again use the ratingsAndPredictions RDD we computed in our earlier example: import org.apache.spark.mllib.evaluation.RegressionMetrics val predictedAndTrue = ratingsAndPredictions.map { case ((user, product), (predicted, actual)) => (predicted, actual) } val regressionMetrics = new RegressionMetrics(predictedAndTrue) We can then access various metrics, including MSE and RMSE. We will print out these metrics here: println("Mean Squared Error = " + regressionMetrics.meanSquaredError) println("Root Mean Squared Error = " + regressionMetrics.rootMeanSquaredError) You will see that the output for MSE and RMSE is exactly the same as the metrics we computed earlier: Mean Squared Error = 0.08231947642632852 Root Mean Squared Error = 0.2869137090247319 MAP As we did for MSE and RMSE, we can compute ranking-based evaluation metrics using MLlib's RankingMetrics class. Similarly, to our own average precision function, we need to pass in an RDD of key-value pairs, where the key is an Array of predicted item IDs for a user, while the value is an array of actual item IDs. The implementation of the average precision at the K function in RankingMetrics is slightly different from ours, so we will get different results. However, the computation of the overall mean average precision (MAP, which does not use a threshold at K) is the same as our function if we select K to be very high (say, at least as high as the number of items in our item set): First, we will calculate MAP using RankingMetrics: import org.apache.spark.mllib.evaluation.RankingMetrics val predictedAndTrueForRanking = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2)   (predicted.toArray, actual.toArray) } val rankingMetrics = new RankingMetrics(predictedAndTrueForRanking) println("Mean Average Precision = " + rankingMetrics.meanAveragePrecision) You will see the following output: Mean Average Precision = 0.07171412913757183 Next, we will use our function to compute the MAP in exactly the same way as we did previously, except that we set K to a very high value, say 2000: val MAPK2000 = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2).toSeq   avgPrecisionK(actual, predicted, 2000) }.reduce(_ + _) / allRecs.count println("Mean Average Precision = " + MAPK2000) You will see that the MAP from our own function is the same as the one computed using RankingMetrics: Mean Average Precision = 0.07171412913757186 We will not cover cross validation in this article. However, note that the same techniques for cross-validation can be used to evaluate recommendation models, using the performance metrics such as MSE, RMSE, and MAP, which we covered in this section. Summary In this article, we used Spark's MLlib library to train a collaborative filtering recommendation model, and you learned how to use this model to make predictions for the items that a given user might have a preference for. We also used our model to find items that are similar or related to a given item. Finally, we explored common metrics to evaluate the predictive capability of our recommendation model. To learn more about Spark, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Fast Data Processing with Spark - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/fast-data-processing-spark-second-edition) Spark Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook) Resources for Article: Further resources on this subject: Reactive Programming And The Flux Architecture [article] Spark - Architecture And First Program [article] The Design Patterns Out There And Setting Up Your Environment [article]
Read more
  • 0
  • 1
  • 18029

article-image-machine-learning-review
Packt
18 Jul 2017
20 min read
Save for later

Machine Learning Review

Packt
18 Jul 2017
20 min read
In this article by Uday Kamath and Krishna Choppella, authors for the book Mastering Java Machine Learning, will discuss how in recent years a revival of interest is seen in the area ofartificial intelligence (AI)and machine learning, in particular, both in academic circles and industry. In the last decade, AI has seen dramatic successes that eluded practitioners in the intervening years since the original promise of the field gave way to relative decline until its re-emergence in the last few years. (For more resources related to this topic, see here.) What made these successes possible, in large part, was the availability of prodigious amounts of data and the inexorable increase in raw computational power. Among the areas of AI leading the resurgence, machine learning has seen spectacular developments and continues to find the widest applicability in an array of domains. The use of machine learning to help in complex decision making at the highest levels of business, and at the same time, its enormous success in improving the accuracy of what are now everyday applications, such assearch, speech recognition, and personal assistants on mobile phones,has made its effects commonplace in the family room and the boardroom alike. Articles breathlessly extolling the power of "deep learning" can be found today not only in the popular science and technology press, but also in mainstream outlets such as The New York Times and The Huffington Post. Machine learning has indeed become ubiquitous in a relatively short time. An ordinary user encounters machine learning in many ways in his day-to-day activities. Interacting with well-known e-mail providers such as Gmail gives the user automated sorting and categorization of e-mails into categories, such as spam, junk, promotions, and so on,which is made possible using text mining, a branch of machine learning. When shopping online for products on ecommerce websites such as https://www.amazon.com/ or watching movies from content providers such as http://netflix.com/, one is offered recommendations for other products and content by so-called recommender systems, another branch of machine learning. Forecasting the weather, estimating real estate prices, predicting voter turnout and even election results—all use some form of machine learningto see into the future as it were. The ever-growing availability of data and the promise of systems that can enrich our lives by learning from that data place a growing demand on skills from a limited workforce of professionals in the field of data science. This demand is particularly acute for well-trained experts who know their way around the landscape of machine learning techniques in the more popular languages, including Java, Python, R, and increasingly, Scala. By far, the number and availability of machine learning libraries, tools, APIs, and frameworks in Java outstrip those in other languages. Consequently, mastery of these skills will put any aspiring professional with a desire to enter the field at a distinct advantage in the marketplace. Perhaps you already apply machine learning techniques in your professional work, or maybe you simply have a hobbyist's interest in the subject.Clearly, you can bend Java to your will, but now you feel you're ready to dig deeper and learn how to use thebest of breed open-source ML Java frameworks in your next data science project. Mastery of a subject, especially one that has such obvious applicability as machine learning, requires more than an understanding of its core concepts and familiarity with its mathematical underpinnings. Unlike an introductory treatment of the subject, a project that purports to help you master the subject must be heavily focused on practical aspects in addition to introducing more advanced topics that would have stretched the scope of the introductory material.To warm up before we embark on sharpening our instrument, we will devote this article to a quick review of what we already know.For the ambitious novice with little or no prior exposure to the subject (who is nevertheless determined to get the fullest benefit from this article), here's our advice: make sure you do not skip the rest of this article instead, use it as a springboard to explore unfamiliar concepts in more depth. Seek out external resources as necessary.Wikipedia it. Then jump right back in. For the rest of this article, we will review the following: History and definitions What is not machine learning? Concepts and terminology Important branches of machine learning Different data types in machine learning Applications of machine learning Issues faced in machine learning The meta-process used in most machine learning projects Information on some well-known tools, APIs,and resources that we will employ in this article Machine learning –history and definition It is difficult to give an exact history, but the definition of machine learning we use today finds its usage as early as in the 1860s.In Rene Descartes' Discourse on the Method, he refers to Automata and saysthe following: For we can easily understand a machine's being constituted so that it can utter words, and even emit some responses to action on it of a corporeal kind, which brings about a change in its organs; for instance, if touched in a particular part it may ask what we wish to say to it; if in another part it may exclaim that it is being hurt, and so on http://www.earlymoderntexts.com/assets/pdfs/descartes1637.pd https://www.marxists.org/reference/archive/descartes/1635/discourse-method.htm Alan Turing, in his famous publication Computing Machinery and Intelligence, gives basic insights into the goals of machine learning by asking the question "Can machines think?". http://csmt.uchicago.edu/annotations/turing.htm http://www.csee.umbc.edu/courses/471/papers/turing.pdf Arthur Samuel, in 1959,wrote,"Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.". Tom Mitchell, in recent times, gave a more exact definition of machine learning:"A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E." Machine Learning has a relationship with several areas: Statistics: This uses the elements of data sampling, estimation, hypothesis testing, learning theory, and statistical based modeling, to name a few Algorithms and computation: This uses basics of search, traversal, parallelization, distributed computing, and so on from basic computer science Database and knowledge discovery: This has the ability to store, retrieve, access information in various formats Pattern recognition: This has the ability to find interesting patterns from the data either to explore, visualize, or predict Artificial Intelligence: Though it is considered a branch of artificial intelligence, it also has relationships with other branches, such as heuristics, optimization, evolutionary computing, and so on. What is not machine learning? It is important to recognize areas that share a connection with machine learning but cannot themselves be considered as being part of machine learning. Some disciplines may overlap to a smaller or larger extent, yet the principles underlying machine learning are quite distinct: Business intelligence and reporting: Reporting Key Performance Indicators (KPIs), querying OLAP for slicing, dicing, and drilling into the data, dashboards,and so on. that form central components of BI are not machine learning. Storage and ETL: Data storage and ETL are key elements needed in any machine learning process, but by themselves, they don't qualify as machine learning. Information retrieval, search, and queries: The ability to retrieve the data or documents based on search criteria or indexes, which form the basis of information retrieval, are not really machine learning. Many forms of machine learning, such as semi-supervised learning, can rely on search of similar data for modeling but that doesn't qualify search as machine learning. Knowledge representation and reasoning: Representing knowledge for performing complex tasks such as Ontology, Expert Systems, and Semantic Web do not qualify as machine learning. Machine learning –concepts and terminology In this article, we will describe different concepts and terms normally used in machine learning: Data or dataset: The basics of machine learning rely on understanding the data. The data or dataset normally refers to content available in structured or unstructured format for using in machine learning. Structured datasets have specific formats, and an unstructured dataset is normally in the form of some free flowing text. Data can be available in various storage types or formats. In structured data, every element known as an instance or an example or row follows a predefined structure. Data can be also be categorized by size; small or medium data have a few hundreds to thousands of instances, whereas big data refers to large volume, mostly in the millions or billions, which cannot be stored or accessed using common devices or fit in the memory of such devices. Features, attributes, variables or dimensions: In structured datasets, as mentioned earlier, there are predefined elements with their own semantic and data type, which are known variously as features, attributes, variables, or dimensions. Data types: The preceding features defined need some form of typing in many machine learning algorithms or techniques. The most commonly used data types are as follows: Categorical or nominal: This indicates well-defined categories or values present in the dataset. For example, eye color, such as black, blue, brown, green, or grey; document content type, such as text, image, or video. Continuous or numeric: This indicates the numeric nature of the data field. For example, a person's weight measured by a bathroom scale, temperature from a sensor, the monthly balance in dollars on a credit card account. Ordinal: This denotes the data that can be ordered in some way. For example, garment size, such as small, medium, or large; boxing weight classes, such as heavyweight, light heavyweight, middleweight, lightweight,and bantamweight. Target or label: A feature or set of features in the dataset, which is used for learning from training data and predicting in unseen dataset, is known as a target or a label. A label can have any form as specified earlier, that is, categorical, continuous, or ordinal. Machine learning model: Each machine learning algorithm, based on what it learned from the dataset, maintains the state of its learning for predicting or giving insights into future or unseen data. This is referred to as the machine learning model. Sampling: Data sampling is an essential step in machine learning. Sampling means choosing a subset of examples from a population with the intent of treating the behavior seen in the (smaller) sample as being representative of the behavior of the (larger) population. In order for the sample to be representative of the population, care must be taken in the way the sample is chosen. Generally, a population consists of every object sharing the properties of interest in the problem domain,for example,all people eligible to vote in the general election, all potential automobile owners in the next four years.Since it is usually prohibitive (or impossible) to collect data for all the objects in a population, a well-chosen subset is selected for the purposes of analysis.A crucial consideration in the sampling process is that the sample be unbiased with respect to the population. The following are types of probability based sampling: Uniform random sampling: A sampling method when the sampling is done over a uniformly distributed population, that is, each object has an equal probability of being chosen. Stratified random sampling: A sampling method when the data can be categorized into multiple classes.In such cases, in order to ensure all categories are represented in the sample, the population is divided into distinct strata based on these classifications, and each stratum is sampled in proportion to the fraction of its class in the overall population. Stratified sampling is common when the population density varies across categories, and it is important to compare these categories with the same statistical power. Cluster sampling: Sometimes there are natural groups among the population being studied, and each group is representative of the whole population.An example is data that spans many geographical regions. In cluster sampling you take a random subset of the groups followed by a random sample from within each of those groups to construct the full data sample.This kind of sampling can reduce costs of data collection without compromising the fidelity of distribution in the population. Systematic sampling: Systematic or interval sampling is used when there is a certain ordering present in the sampling frame (a finite set of objects treated as the population and taken to be the source of data for sampling, for example, the corpus of Wikipedia articles arranged lexicographically by title). If the sample is then selected by starting at a random object and skipping a constant k number of object before selecting the next one, that is called systematic sampling.K is calculated as the ratio of the population and the sample size. Model evaluation metrics: Evaluating models for performance is generally based on different evaluation metrics for different types of learning. In classification, it is generally based on accuracy, receiver operating characteristics (ROC) curves, training speed, memory requirements, false positive ratio,and so on. In clustering, the number of clusters found, cohesion, separation, and so on form the general metrics. In stream-based learning apart from preceding standard metrics mentioned, adaptability, speed of learning, and robustness to sudden changes are some of the conventional metrics for evaluating the performance of the learner. To illustrate these concepts, a concrete example in the form of a well-known weather dataset is given.The data gives a set of weather conditions and a label that indicates whether the subject decided to play a game of tennis on the day or not: @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no The dataset is in the format of an ARFF (Attribute-Relation File Format) file. It consists of a header giving the information about features or attributes with their data types and actual comma separated data following the data tag. The dataset has five features: outlook, temperature, humidity, windy, and play. The features outlook and windy are categorical features, while humidity and temperature are continuous. The feature play is the target and is categorical. Machine learning –types and subtypes We will now explore different subtypes or branches of machine learning. Though the following list is not comprehensive, it covers the most well-known types: Supervised learning: This is the most popular branch of machine learning, which is about learning from labeled data. If the data type of the label is categorical, it becomes a classification problem, and if numeric, it becomes a regression problem. For example, if the target of the dataset is detection of fraud, which has categorical values of either true or false, we are dealing with a classification problem. If, on the other hand, the target is to predict thebest price to list the sale of a home at, which is a numeric dollar value, the problem is one of regression. The following diagram illustrates labeled data that is conducive to classification techniques that are suitable for linearly separable data, such as logistic regression: Linearly separable data An example of dataset that is not linearly separable. This type of problem calls for classification techniques such asSupport Vector Machines. Unsupervised learning: Understanding the data and exploring it in order to buildmachine learning models when the labels are not given is called unsupervised learning. Clustering, manifold learning, and outlier detection are techniques that are covered in this topic. Examples of problems that require unsupervised learning are many; grouping customers according to their purchasing behavior is one example.In the case of biological data, tissues samples can be clustered based on similar gene expression values using unsupervised learning techniques The following diagram represents data with inherent structure that can be revealed as distinct clusters using an unsupervised learning technique such as K-Means: Clusters in data Different techniques are used to detect global outliers—examples that are anomalous with respect to the entire data set, and local outliers—examples that are misfits in their neighborhood. In the following diagam, the notion of local and global outliers is illustrated for a two-feature dataset: Local and Global outliers Semi-supervised learning: When the dataset has some labeled data and large data, which is not labeled, learning from such dataset is called semi-supervised learning. When dealing with financial data with the goal to detect fraud, for example, there may be a large amount of unlabeled data and only a small number of known fraud and non-fraud transactions.In such cases, semi-supervised learning may be applied. Graph mining: Mining data represented as graph structures is known as graph mining. It is the basis of social network analysis and structure analysis in different bioinformatics, web mining, and community mining applications. Probabilistic graphmodeling and inferencing: Learning and exploiting structures present between features to model the data comes under the branch of probabilistic graph modeling. Time-series forecasting: This is a form of learning where data has distinct temporal behavior and the relationship with time is modeled.A common example is in financial forecasting, where the performance of stocks in a certain sector may be the target of the predictive model. Association analysis:This is a form of learning where data is in the form of an item set or market basket and association rules are modeled to explore and predict the relationships between the items. A common example in association analysis is to learn relationships between the most common items bought by the customers when they visit the grocery store. Reinforcement learning: This is a form of learning where machines learn to maximize performance based on feedback in the form of rewards or penalties received from the environment. A recent example that famously used reinforcement learning was AlphaGo, the machine developed by Google that beat the world Go champion Lee Sedoldecisively, in March 2016.Using a reward and penalty scheme, the model first trained on millions of board positions in the supervised learning stage, then played itselfin the reinforcement learning stage to ultimately become good enough to triumph over the best human player. http://www.theatlantic.com/technology/archive/2016/03/the-invisible-opponent/475611/ https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf Stream learning or incremental learning: Learning in supervised, unsupervised, or semi-supervised manner from stream data in real time or pseudo-real time is called stream or incremental learning. Learning the behaviors of sensors from different types of industrial systems for categorizing into normal and abnormal needs real time feed and detection. Datasets used in machine learning To learn from data, we must be able to understand and manage data in all forms.Data originates from many different sources, and consequently, datasets may differ widely in structure or have little or no structure at all.In this section, we present a high level classification of datasets with commonly occurring examples. Based on structure, dataset may be classified as containing the following: Structured or record data: Structured data is the most common form of dataset available for machine learning. The data is in theform of records or rows following a well-known format with features that are either columns in a table or fields delimited by separators or tokens. There is no explicit relationship between the records or instances. The dataset is available mostly in flat files or relational databases. The records of financial transactions at a bank shown in the following screenshotare an example of structured data: Financial Card Transactional Data with labels of Fraud. Transaction or market data: This is a special form of structured data whereeach corresponds to acollection of items. Examples of market dataset are the list of grocery item purchased by different customers, or movies viewed by customers as shown in the following screenshot: Market Dataset for Items bought from grocery store. Unstructured data: Unstructured data is normally not available in well-known formats such as structured data. Text data, image, and video data are different formats of unstructured data. Normally, a transformation of some form is needed to extract features from these forms of data to the aforementioned structured datasets so that traditional machine learning algorithms can be applied: Sample Text Data from SMS with labels of spam and ham from by Tiago A. Almeida from the Federal University of Sao Carlos. Sequential data: Sequential data have an explicit notion of order to them. The order can be some relationship between features and time variable in time series data, or symbols repeating in some form in genomic datasets. Two examples are weather data and genomic sequence data. The following diagram shows the relationship between time and the sensor level for weather: Time Series from Sensor Data Three genomic sequences are taken into consideration to show the repetition of the sequences CGGGT and TTGAAAGTGGTG in all the three genomic sequences: Genomic Sequences of DNA as sequence of symbols. Graph data: Graph data is characterized by the presence of relationships between entities in the data to form a graph structure. Graph datasets may be in structured record format or unstructured format. Typically, the graph relationship has to be mined from the dataset. Claims in the insurance domain can be considered structured records containingrelevant claims details withclaimants related through addresses/phonenumbers,and so on.This can be viewed in graph structure. Using theWorld Wide Web as an example, we have web pages available as unstructured datacontaininglinks,and graphs of relationships between web pages that can be built using web links, producing some of the most mined graph datasets today: Insurance Claim Data, converted into graph structure with relationship between vehicles, drivers, policies and addresses Machine learning applications Given the rapidly growing use of machine learning in diverse areas of human endeavor, any attempt to list typical applications in the different industries, where some form of machine learning is in use,must necessarily be incomplete. Nevertheless, in this section we list a broad set of machine learning applications by domain, uses and the type of learning used: Domain/Industry Applications Machine Learning Type Financial Credit Risk Scoring, Fraud Detection, Anti-Money Laundering Supervised, Unsupervised, Graph Models, Time Series, and Stream Learning Web Online Campaigns, Health Monitoring, Ad Targeting Supervised, Unsupervised, Semi-Supervised Healthcare Evidence-based Medicine, Epidemiological Surveillance, Drug Events Prediction, Claim Fraud Detection Supervised, Unsupervised, Graph Models, Time Series, and Stream Learning Internet of Thing (IoT) Cyber Security, Smart Roads, Sensor Health Monitoring Supervised, Unsupervised, Semi-Supervised, and Stream Learning Environment Weather forecasting, Pollution modeling, Water quality measurement Time Series, Supervised, Unsupervised, Semi-Supervised, and Stream Learning Retail Inventory, Customer Management and Recommendations, Layout and Forecasting Time Series, Supervised, Unsupervised, Semi-Supervised, and Stream Learning Summary: A revival of interest is seen in the area of artificial intelligence (AI)and machine learning, in particular, both in academic circles and industry. The use of machine learning  is to help in complex decision making at the highest levels of business. It has also achieved enormous success in improving the accuracy of everyday applications, such as search, speech recognition, and personal assistants on mobile phones. The basics of machine learning rely on understanding of data.Structured datasets have specific formats, and an unstructured dataset is normally in the form of some free flowing text. Machine learning is of two types: Supervised learning is the popular branch of machine learning, which is about learning from labeled data and Unsupervised learning is understanding the data and exploring it in order to build machine learning models when the labels are not given.  Resources for Article: Further resources on this subject: Specialized Machine Learning Topics [article] Machine learning in practice [article] Introduction to Machine Learning with R [article]
Read more
  • 0
  • 0
  • 17953
article-image-mark-zuckerberg-congressional-testimony-5-things-learned
Richard Gall
11 Apr 2018
8 min read
Save for later

Mark Zuckerberg's Congressional testimony: 5 things we learned

Richard Gall
11 Apr 2018
8 min read
Mark Zuckerberg yesterday (April 10 2018) testified in front of congress. That's a pretty big deal. Congress has been waiting some time for the chance to grill the Facebook chief, with "Zuck" resisting. So the fact that he finally had his day in D.C. indicates the level of pressure currently on him. Some have lamented the fact that senators were given so little time to respond to Zuckerberg - there was no time to really get deep into the issues at hand. However, although it's true that there was a lot that was superficial about the event, if you looked closely, there was plenty to take away from it. Here are the 5 of the most important things we learned from Mark Zuckerberg's testimony in front of Congress. Policy makers don't really understand that much about tech The most shocking thing to come out of Zuckerberg's testimony was unsurprising; the fact that some of the most powerful people in the U.S. don't really understand the technology that's being discussed. More importantly this is technology they're going to have to be making decisions on. One Senator brought printouts of Facebook pages and asked Zuckerberg if these were examples of Russian propaganda groups. Another was confused about Facebook's business model - how could it run a free service and still make money? Those are just two pretty funny examples, but the senators' lack of understanding could be forgiven due to their age. However, there surely isn't any excuse for 45 year old Senator Brian Schatz to misunderstand the relationship between Whatsapp and Facebook. https://twitter.com/pdmcleod/status/983809717116993537 Chris Cillizza argued on CNN that "the senate's tech illiteracy saved Zuckerberg". He explained: The problem was that once Zuckerberg responded - and he largely stuck to a very strict script in doing so - the lack of tech knowledge among those asking him questions was exposed. The result? Zuckerberg was rarely pressed, rarely forced off his talking points, almost never made to answer for the very real questions his platform faces. This lack of knowledge led to proceedings being less than satisfactory for onlookers. Until this knowledge gap is tackled, it's always going to be a challenge for political institutions to keep up with technological innovators. Ultimately, that's what makes regulation hard. Zuckerberg is still held up as the gatekeeper of tech in 2018 Zuckerberg is still held up as a gatekeeper or oracle of modern technology. That is probably a consequence of the point above. Because there's such a knowledge gap within the institutions that govern and regulate, it's more manageable for them to look to a figurehead. That, of course, goes both ways - on the one hand Zuckerberg is a fountain of knowledge, someone who can solve these problems. On the other hand is part of a Silicon Valley axis of evil, nefariously plotting the downfall of democracy and how to read your WhatsApp messages. Most people know that neither is true. The key point, though, is that however you feel about Zuckerberg, he is not the man you need to ask about regulation. This is something that Zephy Teachout argues on the Guardian. "We shouldn’t be begging for Facebook’s endorsement of laws, or for Mark Zuckerberg’s promises of self-regulation" she writes. In fact, one of the interesting subplots of the hearing was the fact that Zuckerberg didn't actually know that much. For example, a lot has been made of how extensive his notes were. And yes, you certainly would expect someone facing a panel of Senators in Washington to be well-briefed. But it nevertheless underlines an important point - the fact that Facebook is a complex and multi-faceted organization that far exceeds the knowledge of its founder and CEO. In turn, this tells you something about technology that's often lost within the discourse: the fact that its hard to consider what's happening at a superficial or abstract level without completely missing the point. There's a lot you could say about Zuckerberg's notes. One of the most interesting was the point around GDPR. The note is very prescriptive: it says "Don't say we already do what GDPR requires." Many have noted that this throws up a lot of issues, not least how Facebook plan to tackle GDPR in just over a month if they haven't moved on it already. But it's the suggestion that Zuckerberg was completely unaware of the situation that is most remarkable here. He doesn't even know where his company is on one of the most important pieces of data legislation for decades. Facebook is incredibly naive If senators were often naive - or plain ignorant - on matters of technology - during the hearing, there was plenty of evidence to indicate that Zuckerberg is just as naive. The GDPR issue mentioned above is just one example. But there are other problems too. You can't, for example, get much more naive than thinking that Cambridge Analytica had deleted the data that Facebook had passed to it. Zuckerberg's initial explanation was that he didn't realize that Cambridge Analytica was "not an app developer or advertiser", but he corrected this saying that his team told him they were an advertiser back in 2015, which meant they did have reason to act on it but chose not to. Zuckerberg apologized for this mistake, but it's really difficult to see how this would happen. There almost appears to be a culture of naivety within Facebook, whereby the organization generally, and Zuckerberg specifically, don't fully understand the nature of the platform it has built and what it could be used for. It's only now, with Zuckerberg talking about an "arms race" with Russia that this naivety is disappearing. But its clear there was an organizational blindspot that has got us to where we are today. Facebook still thinks AI can solve all of its problems The fact that Facebook believes AI is the solution to so many of its problems is indicative of this ingrained naivety. When talking to Congress about the 'arms race' with Russian intelligence, and the wider problem of hate speech, Zuckerberg signaled that the solution lies in the continued development of better AI systems. However, he conceded that building systems actually capable of detecting such speech could be 5 to 10 years away. This is a problem. It's proving a real challenge for Facebook to keep up with the 'misuse' of its platform. Foreign Policy reports that: "...just last week, the company took down another 70 Facebook accounts, 138 Facebook pages, and 65 Instagram accounts controlled by Russia’s Internet Research Agency, a baker’s dozen of whose executives and operatives have been indicted by Special Counsel Robert Mueller for their role in Russia’s campaign to propel Trump into the White House." However, the more AI comes to be deployed on Facebook, the more that the company is going to have to rethink how it describes itself. By using algorithms to regulate the way the platform is used, there comes to be an implicit editorializing of content. That's not necessarily a bad thing, but it does mean we again return to this final problem... There's still confusion about the difference between a platform and a publisher Central to every issue that was raised in Zuckerberg's testimony was the fact that Facebook remains confused about whether it is a platform or a publisher. Or, more specifically, the extent to which it is responsible for the content on the platform. It's hard to single out Zuckerberg here because everyone seems to be confused on this point. But it's interesting that he seems to have never really thought about the problem. That does seem to be changing, however. In his testimony, Zuckerberg said that "Facebook was responsible" for the content on its platforms. This statement marks a big change from the typical line used by every social media platform - that platforms are just platforms, they bear no responsibility for what is published on them. However, just when you think Zuckerberg is making a definitive statement, he steps back. He went on to say that "I agree that we are responsible for the content, but we don't produce the content." This statement hints that he still wants to keep the distinction between platform and publisher. Unfortunately for Zuckerberg, that might be too late. Read Next OpenAI charter puts safety, standards, and transparency first ‘If tech is building the future, let’s make that future inclusive and representative of all of society’ – An interview with Charlotte Jee What your organization needs to know about GDPR 20 lessons on bias in machine learning systems by Kate Crawford at NIPS 2017
Read more
  • 0
  • 0
  • 17950

article-image-nvidia-demos-a-style-based-generative-adversarial-network-that-can-generate-extremely-realistic-images-has-ml-community-enthralled
Prasad Ramesh
17 Dec 2018
4 min read
Save for later

NVIDIA demos a style-based generative adversarial network that can generate extremely realistic images; has ML community enthralled

Prasad Ramesh
17 Dec 2018
4 min read
In a paper published last week, NVIDIA researchers come up with a way to generate photos that look like they were clicked with a camera. This is done via using generative adversarial networks (GANs). An alternative architecture for GANs Borrowing from style transfer literature, the researchers use an alternative generator architecture for GANs. The new architecture induces an automatically learned unsupervised separation of high-level attributes of an image. These attributes can be pose or identity of a person. Images generated via the architecture have some stochastic variation applied to them like freckles, hair placement etc. The architecture allows intuitive and scale-specific control of the synthesis to generate different variations of images. Better image quality than a traditional GAN This new generator is better than the state-of-the-art with respect to image quality, the images have better interpolation properties and disentangles the latent variation factors better. In order to quantify the interpolation quality and disentanglement, the researchers propose two new automated methods which are applicable to any generator architecture. They use a new high quality, highly varied data set with human faces. With motivation from transfer literature, NVIDIA researchers re-design the generator architecture to expose novel ways of controlling image synthesis. The generator starts from a learned constant input and adjusts the style of an image at each convolution layer. It makes the changes based on the latent code thereby having direct control over the strength of image features across different scales. When noise is injected directly into the network, this architectural change causes automatic separation of high-level attributes in an unsupervised manner. Source: A Style-Based Generator Architecture for Generative Adversarial Networks In other words, the architecture combines different images, their attributes from the dataset, applies some variations to synthesize images that look real. As proven in the paper, surprisingly, the redesign of images does not compromise image quality but instead improves it considerably. In conclusion with other works, a traditional GAN generator architecture is inferior to a style-based design. Not only human faces but they also generate bedrooms, cars, and cats with this new architecture. Public reactions This synthetic image generation has generated excitement among the public. A comment from Hacker News reads: “This is just phenomenal. Can see this being a fairly disruptive force in the media industry. Also, sock puppet factories could use this to create endless numbers of fake personas for social media astroturfing.” Another comment reads: “The improvements in GANs from 2014 are amazing. From coarse 32x32 pixel images, we have gotten to 1024x1024 images that can fool most humans.” Fake photographic images as evidence? As a thread on Twitter suggests, can this be the end of photography as evidence? Not very likely, at least for the time being. For something to be considered as evidence, there are many poses, for example, a specific person doing a specific action. As seen from the results in tha paper, some cat images are ugly and deformed, far from looking like the real thing. Also “Our training time is approximately one week on an NVIDIA DGX-1 with 8 Tesla V100 GPUs” now that a setup that costs up to $70K. Besides, some speculate that there will be bills in 2019 to control the use of such AI systems: https://twitter.com/BobbyChesney/status/1074046157431717894 Even the big names in AI are noticing this paper: https://twitter.com/goodfellow_ian/status/1073294920046145537 You can see a video showcasing the generated images on YouTube. This AI generated animation can dress like humans using deep reinforcement learning DeepMasterPrints: ‘master key’ fingerprints made by a neural network can now fake fingerprints UK researchers have developed a new PyTorch framework for preserving privacy in deep learning
Read more
  • 0
  • 0
  • 17949
Modal Close icon
Modal Close icon