Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-leaders-successful-agile-enterprises-share-in-common
Packt Editorial Staff
30 Jul 2018
11 min read
Save for later

What leaders at successful agile Enterprises share in common

Packt Editorial Staff
30 Jul 2018
11 min read
Adopting agile ways of working is easier said than done. Firms like Barclays, C.H.Robinson, Ericsson, Microsoft, and Spotify are considered as agile enterprises and are operating entrepreneurially on a large scale. Do you think the leadership of these firms have something in common? Let us take a look at it in this article. The leadership of a firm has a very high bearing on the extent of Enterprise Agility which the company can achieve. Leaders are in a position to influence just about every aspect of a business, including vision, mission, strategy, structure, governance, processes, and more importantly, the culture of the enterprise and the mindset of the employees. This article is an extract from the Enterprise Agility written by Sunil Mundra. In this article we’ll explore the personal traits of leaders that are critical for Enterprise Agility. Personal traits are by definition intrinsic in nature. They enable the personal development of an individual and are also enablers for certain behaviors. We explore the various personal traits in detail. #1 Willingness to expand mental models Essentially, a mental model is an individual's perception of reality and how something works in that reality. A mental model represents one way of approaching a situation and is a form of deeply-held belief. The critical point is that a mental model represents an individual's view, which may not be necessarily true. Leaders must also consciously let go of mental models that are no longer relevant today. This is especially important for those leaders who have spent a significant part of their career leading enterprises based on mechanistic modelling, as these models will create impediments for Agility in "living" businesses. For example, using monetary rewards as a primary motivator may work for physical work, which is repetitive in nature. However, it does not work as a primary motivator for knowledge workers, for whom intrinsic motivators, namely, autonomy, mastery, and purpose, are generally more important than money. Examining the values and assumptions underlying a mental model can help in ascertaining the relevance of that model. #2 Self-awareness Self-awareness helps leaders to become cognizant of their strengths and weaknesses. This will enable the leaders to consciously focus on utilizing their strengths and leveraging the strengths of their peers and teams, in areas where they are not strong. Leaders should validate the view of strengths and weaknesses by seeking feedback regularly from people that they work with. According to a survey of senior executives, by Cornell's School of Industrial and Labor Relations: "Leadership searches give short shrift to 'self-awareness,' which should actually be a top criterion. Interestingly, a high self-awareness score was the strongest predictor of overall success. This is not altogether surprising as executives who are aware of their weaknesses are often better able to hire subordinates who perform well in categories in which the leader lacks acumen. These leaders are also more able to entertain the idea that someone on their team may have an idea that is even better than their own." Self-awareness, a mostly underrated trait, is a huge enabler for enhancing other personal traits. #3 Creativity Since emergence is a primary property of complexity, leaders will often be challenged to deal with unprecedented circumstances emerging from within the enterprise and also in the external environment. This implies that what may have worked in the past is less likely to work in the new circumstances, and new approaches will be needed to deal with them. Hence, the ability to think creatively, that is, "out of the box," for coming up with innovative approaches and solutions is critical. The creativity of an individual will have its limitations, and hence leaders must harness the creativity of a broader group of people in the enterprise. A leader can be a huge enabler to this by ideating jointly with a group of people and also by facilitating discussions by challenging status quo and spurring the teams to suggest improvements. Leaders can also encourage innovation through experimentation. With the fast pace of change in the external environment, and consequently the continuous evolution of businesses, leaders will often find themselves out of their comfort zone. Leaders will therefore have to get comfortable with being uncomfortable. It will be easier for leaders to think more creatively once they accept this new reality. #4 Emotional intelligence Emotional intelligence (EI), also known as emotional quotient (EQ), is defined by Wikipedia as "the capability of individuals to recognize their own emotions and those of others, discern between different feelings and label them appropriately, use emotional information to guide thinking and behavior, and manage and/or adjust emotions to adapt to environments or achieve one's goal/s". [iii] EI is made up of four core skills: Self-awareness Social awareness Self-management Relationship management The importance of EI in people-centric enterprises, especially for leaders, cannot be overstated. While people in a company may be bound by purpose and by being a part of a team, people are inherently different from each other in terms of personality types and emotions. This can have a significant bearing on how people in a business deal with and react to circumstances, especially adverse ones. Having high EI enables leaders to understand people "from the inside." This helps leaders to build better rapport with people, thereby enabling them to bring out the best in employees and support them as needed. #5 Courage An innovative approach to dealing with an unprecedented circumstance will, by definition, carry some risk. The hypothesis about the appropriateness of that approach can only be validated by putting it to the test against reality. Leaders will therefore need to be courageous as they take the calculated risky bets, strike hard, and own the outcome of those bets. According to Guo Xiao, the President and CEO of ThoughtWorks, "There are many threats—and opportunities—facing businesses in this age of digital transformation: industry disruption from nimble startups, economic pressure from massive digital platforms, evolving security threats, and emerging technologies. Today's era, in which all things are possible, demands a distinct style of leadership. It calls for bold individuals who set their company's vision and charge ahead in a time of uncertainty, ambiguity, and boundless opportunity. It demands courage." Taking risks does not mean being reckless. Rather, leaders need to take calculated risks, after giving due consideration to intuition, facts, and opinions. Despite best efforts and intentions, some decisions will inevitably go wrong. Leaders must have the courage and humility to admit that the decision went wrong and own the outcomes of that decision, and not let these failures deter them from taking risks in the future. #6 Passion for learning Learnability is the ability to upskill, reskill, and deskill. In today's highly dynamic era, it is not what one knows, or what skills one has, that matters as much as the ability to quickly adapt to a different skill set. It is about understanding what is needed to optimize success and what skills and abilities are necessary, from a leadership perspective, to make the enterprise as a whole successful. Leaders need to shed inhibitions about being seen as "novices" while they acquire and practice new skills. The fact that leaders are willing to acquire new skills can be hugely impactful in terms of encouraging others in the enterprise to do the same. This is especially important in terms of bringing in and encouraging the culture of learnability across the business. #7 Awareness of cognitive biases Cognitive biases are flaws in thinking that can lead to suboptimal decisions. Leaders need to become aware of these biases so that they can objectively assess whether their decisions are being influenced by any biases. Cognitive biases lead to shortcuts in decision-making. Essentially, these biases are an attempt by the brain to simplify information processing. Leaders today are challenged with an overload of information and also the need to make decisions quickly. These factors can contribute to decisions and judgements being influenced by cognitive biases. Over decades, psychologists have discovered a huge number of biases. However, the following biases are more important from decision-making perspective: Confirmation bias This is the tendency of selectively seeking and holding onto information to reaffirm what you already believe to be true. For example, a leader believes that a recently launched product is doing well, based on the initial positive response. He has developed a bias that this product is successful. However, although the product is succeeding in attracting new customers, it is also losing existing customers. The confirmation bias is making the leader focus only on data pertaining to new customers, so he is ignoring data related to the loss of existing customers. Bandwagon effect bias Bandwagon effect bias, also known as "herd mentality," encourages doing something because others are doing it. The bias creates a feeling of not wanting to be left behind and hence can lead to irrational or badly-thought-through decisions. Enterprises launching the Agile transformation initiative, without understanding the implications of the long and difficult journey ahead, is an example of this bias. "Guru" bias Guru bias leads to blindly relying on an expert's advice. This can be detrimental, as the expert could be wrong in their assessment and therefore the advice could also be wrong. Also, the expert might give advice which is primarily furthering his or her interests over the interests of the enterprise. Projection bias Projection bias leads the person to believe that other people have understood and are aligned with their thinking, while in reality this may not be true. This bias is more prevalent in enterprises where employees are fearful of admitting that they have not understood what their "bosses" have said, asking questions to clarify or expressing disagreement. Stability bias Stability bias, also known as "status quo" bias, leads to a belief that change will lead to unfavorable outcomes, that is, the risk of loss is greater than the possibility of benefit. It makes a person believe that stability and predictability lead to safety. For decades, the mandate for leaders was to strive for stability and hence, many older leaders are susceptible to this bias. Leaders must encourage others in the enterprise to challenge biases, which can uncover "blind spots" arising from them. Once decisions are made, attention should be paid to information coming from feedback. #8 Resilience Resilience is the capacity to quickly recover from difficulties. Given the turbulent business environment, rapidly changing priorities, and the need to take calculated risks, leaders are likely to encounter difficult and challenging situations quite often. Under such circumstances, having resilience will help the leader to "take knocks on the chin" and keep moving forward. Resilience is also about maintaining composure when something fails, analyzing the failure with the team in an objective manner and leaning from that failure. The actions of leaders are watched by the people in the enterprise even more closely in periods of crisis and difficulty, and hence leaders showing resilience go a long way in increasing resilience across the company. #9 Responsiveness Responsiveness, from the perspective of leadership, is the ability to quickly grasp and respond to both challenges and opportunities. Leaders must listen to feedback coming from customers and the marketplace, learn from it, and adapt accordingly. Leaders must be ready to enable the morphing of the enterprise's offerings in order to stay relevant for customers and also to exploit opportunities. This implies that leaders must be willing to adjust the "pivot" of their offerings based on feedback, for example, the journey of Amazon Web Services, which was an internal system but has now grown into a highly successful business. Other prominent examples are Twitter, which was an offshoot of Odeo, a website focused on sound and podcasting, and PayPal's move from transferring money via PalmPilots to becoming a highly robust online payment service. We discovered that leaders are the primary catalysts for any enterprise aspiring to enhance its Agility. Leaders need specific capabilities, which are over and above the standard leadership capabilities, in order to take the business on the path of enhanced Enterprise Agility. These capabilities comprise of personal traits and behaviors that are intrinsic in nature and enable leadership Agility, which is the foundation of Enterprise Agility. Want to know more about how an enterprise can thrive in a dynamic business environment, check out the book Enterprise Agility. Skill Up 2017: What we learned about tech pros and developers 96% of developers believe developing soft skills is important Soft skills every data scientist should teach their child
Read more
  • 0
  • 1
  • 21668

article-image-getting-started-with-automated-machine-learning-automl
Kunal Chaudhari
10 May 2018
7 min read
Save for later

Anatomy of an automated machine learning algorithm (AutoML)

Kunal Chaudhari
10 May 2018
7 min read
Machine learning has always been dependent on the selection of the right features within a given model; even the selection of the right algorithm. But deep learning changed this. The selection process is now built into the models themselves. Researchers and engineers are now shofting their focus from feature engineering to network engineering. Out of this, AutoML, or meta learning, has become an increasingly important part of deep learning. AutoML is an emerging research topic which aims at auto-selecting the most efficient neural network for a given learning task. In other words, AutoML represents a set of methodologies for learning how to learn efficiently. Consider for instance the tasks of machine translation, image recognition, or game playing. Typically, the models are manually designed by a team of engineers, data scientist, and domain experts. If you consider that a typical 10-layer network can have ~1010 candidate network, you understand how expensive, error prone, and ultimately sub-optimal the process can be. This article is an excerpt from a book written by Antonio Gulli and Amita Kapoor titled TensorFlow 1.x Deep Learning Cookbook. This book is an easy-to-follow guide that lets you explore reinforcement learning, GANs, autoencoders, multilayer perceptrons and more. AutoML with recurrent networks and with reinforcement learning The key idea to tackle this problem is to have a controller network which proposes a child model architecture with probability p, given a particular network given in input. The child is trained and evaluated for the particular task to be solved (say for instance that the child gets accuracy R). This evaluation R is passed back to the controller which, in turn, uses R to improve the next candidate architecture. Given this framework, it is possible to model the feedback from the candidate child to the controller as the task of computing the gradient of p and then scale this gradient by R. The controller can be implemented as a Recurrent Neural Network (see the following figure). In doing so, the controller will tend to privilege iteration after iterations candidate areas of architecture that achieve better R and will tend to assign a lower probability to candidate areas that do not score so well. For instance, a controller recurrent neural network can sample a convolutional network. The controller can predict many hyper-parameters such as filter height, filter width, stride height, stride width, and the number of filters for one layer and then can repeat. Every prediction can be carried out by a softmax classifier and then fed into the next RNN time step as input. This is well expressed by the following images taken from Neural Architecture Search with Reinforcement Learning, Barret Zoph, Quoc V. Le: Predicting hyperparameters is not enough as it would be optimal to define a set of actions to create new layers in the network. This is particularly difficult because the reward function that describes the new layers is most likely not differentiable. This makes it impossible to optimize using standard techniques such as SGD. The solution comes from reinforcement learning. It consists of adopting a policy gradient network. Besides that, parallelism can be used for optimizing the parameters of the controller RNN. Quoc Le & Barret Zoph proposed to adopt a parameter-server scheme where we have a parameter server of S shards, that store the shared parameters for K controller replicas. Each controller replica samples m different child architectures that are trained in parallel as illustrated in the following images, taken from Neural Architecture Search with Reinforcement Learning, Barret Zoph, Quoc V. Le: Quoc and Barret applied AutoML techniques for Neural Architecture Search to the Penn Treebank dataset, a well-known benchmark for language modeling. Their results improve the manually designed networks currently considered the state-of-the-art. In particular, they achieve a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art model. Similarly, on the CIFAR-10 dataset, starting from scratch, the method can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. The proposed CIFAR-10 model achieves a test error rate of 3.65, which is 0.09 percent better and 1.05x faster than the previous state-of-the-art model that used a similar architectural scheme. Meta-learning blocks In Learning Transferable Architectures for Scalable Image Recognition, Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le, 2017. propose to learn an architectural building block on a small dataset that can be transferred to a large dataset. The authors propose to search for the best convolutional layer (or cell) on the CIFAR-10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more copies of this cell, each with their own parameters. Precisely, all convolutional networks are made of convolutional layers (or cells) with identical structures but different weights. Searching for the best convolutional architectures is therefore reduced to searching for the best cell structures, which is faster more likely to generalize to other problems. Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves, among the published work, state-of-the-art accuracy of 82.7 percent top-1 and 96.2 percent top-5 on ImageNet. The model is 1.2 percent better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS—a reduction of 28% from the previous state of the art model. What is also important to notice is that the model learned with RNN+RL (Recurrent Neural Networks + Reinforcement Learning) is beating the baseline represented by Random Search (RS) as shown in the figure taken from the paper. In the mean performance of the top-5 and top-25 models identified in RL versus RS, RL is always winning: AutoML and learning new tasks Meta-learning systems can be trained to achieve a large number of tasks and are then tested for their ability to learn new tasks. A famous example of this kind of meta-learning is transfer learning, where networks can successfully learn new image-based tasks from relatively small datasets. However, there is no analogous pre-training scheme for non-vision domains such as speech, language, and text. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Chelsea Finn, Pieter Abbeel, Sergey Levine, 2017, proposes a model- agnostic approach names MAML, compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. The meta-learner aims at finding an initialization that rapidly adapts to various problems quickly (in a small number of steps) and efficiently (using only a few examples). A model represented by a parametrized function fθ with parameters θ.When adapting to a new task Ti, the model's parameters θ become θi  . In MAML, the updated parameter vector θi  is computed using one or more gradient descent updates on task Ti. For example, when using one gradient update, θ ~ = θ − α∇θLTi (fθ) where LTi is the loss function for the task T and α is a meta-learning parameter. The MAML algorithm is reported in this figure: MAML was able to substantially outperform a number of existing approaches on popular few-shot image classification benchmark. Few shot image is a quite challenging problem aiming at learning new concepts from one or a few instances of that concept. As an example, Human-level concept learning through probabilistic program induction, Brenden M. Lake, Ruslan Salakhutdinov, Joshua B. Tenenbaum, 2015, suggested that humans can learn to identify novel two-wheel vehicles from a single picture such as the one contained in the box as follows: If you enjoyed this excerpt, check out the book TensorFlow 1.x Deep Learning Cookbook, to skill up and implement tricky neural networks using Google's TensorFlow 1.x AmoebaNets: Google’s new evolutionary AutoML AutoML : Developments and where is it heading to What is Automated Machine Learning (AutoML)?
Read more
  • 0
  • 0
  • 21659

article-image-how-to-execute-jobs-in-an-iterative-way-with-pentaho-data-integration-pdi
Vijin Boricha
26 Feb 2018
8 min read
Save for later

How to execute jobs in an iterative way with Pentaho Data Integration (PDI)

Vijin Boricha
26 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This is a book excerpt  from Learning Pentaho Data Integration 8 CE - Third Edition written by María Carina Roldán.  From this book, you will learn to explore, transform, and integrate your data across multiple sources.[/box] Today, we will learn to configure and use Job executor along with capturing the result filenames. Using Job executors The Job Executor is a PDI step that allows you to execute a Job several times simulating a loop. The executor receives a dataset, and then executes the Job once for each row or a set of rows of the incoming dataset. To understand how this works, we will build a very simple example. The Job that we will execute will have two parameters: a folder and a file. It will create the folder, and then it will create an empty file inside the new folder. Both the name of the folder and the name of the file will be taken from the parameters. The main transformation will execute the Job iteratively for a list of folder and file names. Let's start by creating the Job: Create a new Job. Double-click on the work area to bring up the Job properties window. Use it to define two named parameters: FOLDER_NAME and FILE_NAME. Drag a START, a Create a folder, and a Create file entry to the work area and link them as follows: 4. Double-click the Create a folder entry. As Folder name, type ${FOLDER_NAME}. 5. Double-click the Create file entry. As File name, type ${FOLDER_NAME}/${FILE_NAME}. 6. Save the Job and test it, providing values for the folder and filename. The Job should create a folder with an empty file inside, both with the names that you provide as parameters. Now create the main Transformation: Create a Transformation. Drag a Data Grid step to the work area and define a single field named foldername. The type should be String. Fill the Data Grid with a list of folders to be created, as shown in the next example: 4. As the name of the file, you can create any name of your choice. As an example, we will create a random name. For this, we use a Generate random value and a UDJE step, and configure them as shown: 5. With the last step selected, run a preview. You should see the full list of folders and filenames, as shown in the next sample image: 6. At the end of the stream, add a Job Executor step. You will find it under the Flow category of steps. 7. Double-click on the Job Executor step. 8. As Job, select the path to the Job created before, for example, ${Internal.Entry.Current.Directory}/create_folder_and_file.kjb 9. Configure the Parameters grid as follows: 10. Close the window and save the transformation. 11. Run the transformation. The Step Metrics in the Execution Results window reflects what happens: 12. Click on the Logging tab. You will see the full log for the Job. 13. Browse your filesystem. You will find all the folders and files just created. As you see, PDI executes the Job as many times as the number of rows that arrives to the Job Executor step, once for every row. Each time the Job executes, it receives values for the named parameters, and creates the folder and file using these values. Configuring the executors with advanced settings Just as it happens with the Transformation Executors that you already know, the Job Executors can also be configured with similar settings. This allows you to customize the behavior and the output of the Job to be executed. Let's summarize the options. Getting the results of the execution of the job The Job Executor doesn't cause the Transformation to abort if the Job that it runs has errors. To verify this, run the sample transformation again. As the folders already exist, you expect that each individual execution fails. However, the Job Executor ends without error. In order to capture the errors in the execution of the Job, you have to get the execution results. This is how you do it: Drag a step to the work area where you want to redirect the results. It could be any step. For testing purposes, we will use a Text file output step. Create a hop from the Job Executor toward this new step. You will be prompted for the kind of hop. Choose the This output will contain the execution results option as shown here: 3. Double-click on the Job Executor and select the Execution results tab. You will see the list of metrics and results available. The Field name column has the names of the fields that will contain these results. If there are results you are not interested in, delete the value in the Field name column. For the results that you want to keep, you can leave the proposed field name or type a different name. The following screenshot shows an example that only generates a field for the log: 4. When you are done, click on OK. 5. With the destination step selected, run a preview. You will see the results that you just defined, as shown in the next example: 6. If you copy any of the lines and paste it into a text editor, you will see the full log for the execution, as shown in the following example: 2017/10/26 23:45:53 - create_folder_and_file - Starting entry [Create a folder] 2017/10/26 23:45:53 - create_folder_and_file - Starting entry [Create file] 2017/10/26 23:45:53 - Create file - File [c:/pentaho/files/folder1/sample_50n9q8oqsg6ib.tmp] created! 2017/10/26 23:45:53 - create_folder_and_file - Finished job entry [Create file] (result=[true]) 2017/10/26 23:45:53 - create_folder_and_file - Finished job entry [Create a folder] (result=[true]) Working with groups of data As you know, jobs don't work with datasets. Transformations do. However, you can still use the Job Executor to send the rows to the Job. Then, any transformation executed by your Job can get the rows using a Get rows from result step. By default, the Job Executor executes once for every row in your dataset, but there are several possibilities where you can configure in the Row Grouping tab of the configuration window: You can send groups of N rows, where N is greater than 1 You can pass a group of rows based on the value in a field You can send groups of rows based on the time the step collects rows before executing the Job Using variables and named parameters If the Job has named parameters—as in the example that we built—you provide values for them in the Parameters tab of the Job Executor step. For each named parameter, you can assign the value of a field or a fixed-static-value. In case you execute the Job for a group of rows instead of a single one, the parameters will take the values from the first row of data sent to the Job. Capturing the result filenames At the output of the Job Executor, there is also the possibility to get the result filenames. Let's modify the Transformation that we created to show an example of this kind of output: Open the transformation created at the beginning of the section. Drag a Write to log step to the work area. Create a hop from the Job Executor toward the Write to log step. When asked for the kind of hop, select the option named This output will contain the result file names after execution. Your transformation will look as follows: 4. Double-click the Job Executor and select the Result files tab. 5. Configure it as shown: Double-click the Write to log step and, in the Fields grid, add the FileName field. Close the window and save the transformation. Run it. Look at the Logging tab in the Execution Results window. You will see the names of the files in the result filelist, which are the files created in the Job: ... ... - Write to log.0 - ... - Write to log.0 - ------------> Linenr 1---------------------- -------- ... - Write to log.0 - filename = file:///c:/pentaho/files/folder1/sample_5agh7lj6ncqh7.tmp ... - Write to log.0 - ... - Write to log.0 - ==================== ... - Write to log.0 - ... - Write to log.0 - ------------> Linenr 2---------------------- -------- ... - Write to log.0 - filename = file:///c:/pentaho/files/folder2/sample_6n0rhmrpvj21n.tmp ... - Write to log.0 - ... - Write to log.0 - ==================== ... - Write to log.0 - ... - Write to log.0 - ------------> Linenr 3---------------------- -------- ... - Write to log.0 - filename = file:///c:/pentaho/files/folder3/sample_7ulkja68vf1td.tmp ... - Write to log.0 - ... - Write to log.0 - ==================== ... The example that you just created showed the option with a Job Executor. We learned how to nest jobs and iterate the execution of jobs. You can know more about executing transformations in an iterative way and launching transformations and jobs from the Command Line from this book Learning Pentaho Data Integration 8 CE - Third Edition.      
Read more
  • 0
  • 0
  • 21491

article-image-basic-image-processing
Packt
02 Jun 2015
8 min read
Save for later

Basic Image Processing

Packt
02 Jun 2015
8 min read
In this article, Ashwin Pajankar, the author of the book, Raspberry PI Computer Vision Programming, takes us through basic image processing in OpenCV. We will do this with the help of the following topics: Image arithmetic operations—adding, subtracting, and blending images Splitting color channels in an image Negating an image Performing logical operations on an image This article is very short and easy to code with plenty of hands-on activities. (For more resources related to this topic, see here.) Arithmetic operations on images In this section, we will have a look at the various arithmetic operations that can be performed on images. Images are represented as matrices in OpenCV. So, arithmetic operations on images are similar to the arithmetic operations on matrices. Images must be of the same size for you to perform arithmetic operations on the images, and these operations are performed on individual pixels. cv2.add(): This function is used to add two images, where the images are passed as parameters. cv2.subtract(): This function is used to subtract an image from another. We know that the subtraction operation is not commutative. So, cv2.subtract(img1,img2) and cv2.(img2,img1) will yield different results, whereas cv2.add(img1,img2) and cv2.add(img2,img1) will yield the same result as the addition operation is commutative. Both the images have to be of same size and type, as explained before. Check out the following code: import cv2 img1 = cv2.imread('/home/pi/book/test_set/4.2.03.tiff',1) img2 = cv2.imread('/home/pi/book/test_set/4.2.04.tiff',1) cv2.imshow('Image1',img1) cv2.waitKey(0) cv2.imshow('Image2',img2) cv2.waitKey(0) cv2.imshow('Addition',cv2.add(img1,img2)) cv2.waitKey(0) cv2.imshow('Image1-Image2',cv2.subtract(img1,img2)) cv2.waitKey(0) cv2.imshow('Image2-Image1',cv2.subtract(img2,img1)) cv2.waitKey(0) cv2.destroyAllWindows() The preceding code demonstrates the usage of arithmetic functions on images. Here's the output window of Image1: Here is the output window of Addition: The output window of Image1-Image2 looks like this: Here is the output window of Image2-Image1: Blending and transitioning images The cv2.addWeighted() function calculates the weighted sum of two images. Because of the weight factor, it provides a blending effect to the images. Add the following lines of code before destroyAllWindows() in the previous code listing to see this function in action: cv2.addWeighted(img1,0.5,img2,0.5,0) cv2.waitKey(0) In the preceding code, we passed the following five arguments to the addWeighted() function: Img1: This is the first image. Alpha: This is the weight factor for the first image (0.5 in the example). Img2: This is the second image. Beta: This is the weight factor for the second image (0.5 in the example). Gamma: This is the scalar value (0 in the example). The output image value is calculated with the following formula: This operation is performed on every individual pixel. Here is the output of the preceding code: We can create a film-style transition effect on the two images by using the same function. Check out the output of the following code that creates a smooth image transition from an image to another image: import cv2 import numpy as np import time   img1 = cv2.imread('/home/pi/book/test_set/4.2.03.tiff',1) img2 = cv2.imread('/home/pi/book/test_set/4.2.04.tiff',1)   for i in np.linspace(0,1,40): alpha=i beta=1-alpha print 'ALPHA ='+ str(alpha)+' BETA ='+str (beta) cv2.imshow('Image Transition',    cv2.addWeighted(img1,alpha,img2,beta,0)) time.sleep(0.05) if cv2.waitKey(1) == 27 :    break   cv2.destroyAllWindows() Splitting and merging image colour channels On several occasions, we may be interested in working separately with the red, green, and blue channels. For example, we might want to build a histogram for every channel of an image. Here, cv2.split() is used to split an image into three different intensity arrays for each color channel, whereas cv2.merge() is used to merge different arrays into a single multi-channel array, that is, a color image. The following example demonstrates this: import cv2 img = cv2.imread('/home/pi/book/test_set/4.2.03.tiff',1) b,g,r = cv2.split (img) cv2.imshow('Blue Channel',b) cv2.imshow('Green Channel',g) cv2.imshow('Red Channel',r) img=cv2.merge((b,g,r)) cv2.imshow('Merged Output',img) cv2.waitKey(0) cv2.destroyAllWindows() The preceding program first splits the image into three channels (blue, green, and red) and then displays each one of them. The separate channels will only hold the intensity values of the particular color and the images will essentially be displayed as grayscale intensity images. Then, the program merges all the channels back into an image and displays it. Creating a negative of an image In mathematical terms, the negative of an image is the inversion of colors. For a grayscale image, it is even simpler! The negative of a grayscale image is just the intensity inversion, which can be achieved by finding the complement of the intensity from 255. A pixel value ranges from 0 to 255, and therefore, negation involves the subtracting of the pixel value from the maximum value, that is, 255. The code for the same is as follows: import cv2 img = cv2.imread('/home/pi/book/test_set/4.2.07.tiff') grayscale = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) negative = abs(255-grayscale) cv2.imshow('Original',img) cv2.imshow('Grayscale',grayscale) cv2.imshow('Negative',negative) cv2.waitKey(0) cv2.destroyAllWindows() Here is the output window of Greyscale: Here's the output window of Negative: The negative of a negative will be the original grayscale image. Try this on your own by taking the image negative of the negative again Logical operations on images OpenCV provides bitwise logical operation functions for images. We will have a look at the functions that provide the bitwise logical AND, OR, XOR (exclusive OR), and NOT (inversion) functionality. These functions can be better demonstrated visually with grayscale images. I am going to use barcode images in horizontal and vertical orientation for demonstration. Let's have a look at the following code: import cv2 import matplotlib.pyplot as plt   img1 = cv2.imread('/home/pi/book/test_set/Barcode_Hor.png',0) img2 = cv2.imread('/home/pi/book/test_set/Barcode_Ver.png',0) not_out=cv2.bitwise_not(img1) and_out=cv2.bitwise_and(img1,img2) or_out=cv2.bitwise_or(img1,img2) xor_out=cv2.bitwise_xor(img1,img2)   titles = ['Image 1','Image 2','Image 1 NOT','AND','OR','XOR'] images = [img1,img2,not_out,and_out,or_out,xor_out]   for i in xrange(6):    plt.subplot(2,3,i+1)    plt.imshow(images[i],cmap='gray')    plt.title(titles[i])    plt.xticks([]),plt.yticks([]) plt.show() We first read the images in grayscale mode and calculated the NOT, AND, OR, and XOR, functionalities and then with matplotlib, we displayed those in a neat way. We leveraged the plt.subplot() function to display multiple images. Here in the preceding example, we created a grid with two rows and three columns for our images and displayed each image in every part of the grid. You can modify this line and change it to plt.subplot(3,2,i+1) to create a grid with three rows and two columns. Also, we can use the technique without a loop in the following way. For each image, you have to write the following statements. I will write the code for the first image only. Go ahead and write it for the rest of the five images: plt.subplot(2,3,1) , plt.imshow(img1,cmap='gray') , plt.title('Image 1') , plt.xticks([]),plt.yticks([]) Finally, use plt.show() to display. This technique is to avoid the loop when a very small number of images, usually 2 or 3 in number, have to be displayed. The output of this is as follows: Make a note of the fact that the logical NOT operation is the negative of the image. Exercise You may want to have a look at the functionality of cv2.copyMakeBorder(). This function is used to create the borders and paddings for images, and many of you will find it useful for your projects. The exploring of this function is left as an exercise for the readers. You can check the python OpenCV API documentation at the following location: http://docs.opencv.org/modules/refman.html Summary In this article, we learned how to perform arithmetic and logical operations on images and split images by their channels. We also learned how to display multiple images in a grid by using matplotlib. Resources for Article: Further resources on this subject: Raspberry Pi and 1-Wire [article] Raspberry Pi Gaming Operating Systems [article] The Raspberry Pi and Raspbian [article]
Read more
  • 0
  • 0
  • 21390

article-image-the-road-to-cassandra-4-0-what-does-the-future-have-in-store
Guest Contributor
06 Jul 2019
5 min read
Save for later

The road to Cassandra 4.0 – What does the future have in store?

Guest Contributor
06 Jul 2019
5 min read
In May 2019, DataStax hosted the Accelerate conference for Apache Cassandra™ inviting community members, DataStax customers, and other users to come together, discuss the latest developments around Cassandra, and find out more about the development of Cassandra. Nate McCall, Apache Cassandra Project Chair, presented the road to version 4.0 and what the community is focusing on for the future. So, what does the future really hold for Cassandra? The project has been going for ten years already, so what has to be added?  First off, listening to Nate’s keynote, the approach to development has evolved. As part of the development approach around Cassandra, it’s important to understand who is committing updates to Cassandra. The number of organisations contributing to Cassandra has increased, while the companies involved in the Project Management Committee includes some of the biggest companies in the world.  The likes of Instagram, Facebook and Netflix have team members contributing and leading the development of Cassandra because it is essential to their businesses. For DataStax, we continue to support the growth and development of Cassandra as an open source project through our own code contributions, our development and training, and our drivers that are available for the community and for our customers alike.  Having said all this, there are still areas where Cassandra can improve as we get ready for 4.0. From a development standpoint, the big things to look forward to as mentioned in Nate’s keynote are:  An improved Repair model For a distributed database, being able to carry on through any failure event is critical. After a failure, those nodes will have to be brought back online, and then catch up with the transactions that they missed. Making nodes consistent is a big task, covered by the Repair function. In Cassandra 4.0, the aim is to make Repair smarter. For example, Cassandra can preview the impact of a repair on a host to check that the operation will go through successfully, and specific pull requests for data can also be supported. Alongside this, a new transient replication feature should reduce the cost and bandwidth overhead associated with repair. By replicating temporary copies of data to supplement full copies, the overall cluster should be able to achieve higher levels of availability but at the same time reduce the overall volume of storage required significantly. For companies running very large clusters, the cost savings achievable here could be massive. A Messaging rewrite Efficient messaging between nodes is essential when your database is distributed. Cassandra 4.0 will have a new messaging system in place based on Netty, an asynchronous event-driven network application framework. In practice, using Netty will improve performance of messaging between nodes within clusters and between clusters. On top of this change, zero copy support will provide the ability to improve how quickly data can be streamed between nodes. Zero copy support achieves this by modifying the streaming path to add additional information into the streaming header, and then using ZeroCopy APIs to transfer bytes to and from the network and disk. This allows nodes to transfer large files faster. Cassandra and Kubernetes support Adding new messaging support and being able to transfer SSTables means that Cassandra can add more support for Kubernetes, and for Kubernetes to do interesting things around Cassandra too. One area that has been discussed is around dynamic cluster management, where the number of nodes and the volume of storage can be increased or decreased on demand. Sidecars Sidecars are additional functional tools designed to work alongside a main process. These sidecars fill a gap that is not part of the main application or service, and that should remain separate but linked. For Cassandra, running sidecars allows developers to add more functionality to their operations, such as creating events on an application. Java 11 support Java 11 support has been added to the Cassandra trunk version and will be present in 4.0. This will allow Cassandra users to use Java 11, rather than version 8 which is no longer supported.  Diagnostic events and logging This will make it easier for teams to use events for a range of things, from security requirements through to logging activities and triggering tools.  As part of the conference, there were two big trends that I took from the event. The first is – as Nate commented in his keynote – that there was a definite need for more community events that can bring together people who care about Cassandra and get them working together.   The second is that Apache Cassandra is essential to many companies today. Some of the world’s largest internet companies and most valuable brands out there rely on Cassandra in order to achieve what they do. They are contributors and committers to Cassandra, and they have to be sure that Cassandra is ready to meet their requirements. For everyone using Cassandra, this means that versions have to be ready for use in production rather than having issues to be fixed. Things will get released when they are ready, rather than to meet a particular deadline. And the community will take the lead in ensuring that they are happy with any release.  Cassandra 4.0 is nearing release. It’ll be out when it is ready. Whether you are looking at getting involved with the project through contributions, developing drivers or through writing documentation, there is a warm welcome for everyone in the run up to what should be a great release.  I’m already looking forward to ApacheCon later this year! Author Bio Patrick McFadin is the vice president of developer relations at DataStax, where he leads a team devoted to making users of DataStax products successful. Previously, he was chief evangelist for Apache Cassandra and a consultant for DataStax, where he helped build some of the largest and most exciting deployments in production; a chief architect at Hobsons; and an Oracle DBA and developer for over 15 years.
Read more
  • 0
  • 0
  • 21266

article-image-python-text-processing-nltk-20-creating-custom-corpora
Packt
18 Nov 2010
12 min read
Save for later

Python text processing with NLTK 2.0: creating custom corpora

Packt
18 Nov 2010
12 min read
In this article, we'll cover how to use corpus readers and create custom corpora. At the same time, you'll learn how to use the existing corpus data that comes with NLTK. We'll also cover creating custom corpus readers, which can be used when your corpus is not in a file format that NLTK already recognizes, or if your corpus is not in files at all, but instead is located in a database such as MongoDB. Setting up a custom corpus A corpus is a collection of text documents, and corpora is the plural of corpus. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files. Getting ready You should already have the NLTK data package installed, following the instructions at http://www.nltk.org/data. We'll assume that the data is installed to C:nltk_data on Windows, and /usr/share/nltk_data on Linux, Unix, or Mac OS X. How to do it... NLTK defines a list of data directories, or paths, in nltk.data.path. Our custom corpora must be within one of these paths so it can be found by NLTK. So as not to conflict with the official data package, we'll create a custom nltk_data directory in our home directory. Here's some Python code to create this directory and verify that it is in the list of known paths specified by nltk.data.path: >>> import os, os.path >>> path = os.path.expanduser('~/nltk_data') >>> if not os.path.exists(path): ... os.mkdir(path) >>> os.path.exists(path) True >>> import nltk.data >>> path in nltk.data.path True If the last line, path in nltk.data.path, is True, then you should now have a nltk_ data directory in your home directory. The path should be %UserProfile%nltk_data on Windows, or ~/nltk_data on Unix, Linux, or Mac OS X. For simplicity, I'll refer to the directory as ~/nltk_data. If the last line does not return True, try creating the nltk_data directory manually in your home directory, then verify that the absolute path is in nltk.data.path. It's essential to ensure that this directory exists and is in nltk.data.path before continuing. Once you have your nltk_data directory, the convention is that corpora reside in a corpora subdirectory. Create this corpora directory within the nltk_data directory, so that the path is ~/nltk_data/corpora. Finally, we'll create a subdirectory in corpora to hold our custom corpus. Let's call it cookbook, giving us the full path of ~/nltk_data/corpora/cookbook. Now we can create a simple word list file and make sure it loads. The source code for this article can be downloaded here. Consider a word list file called mywords.txt. Put this file into ~/nltk_data/corpora/cookbook/. Now we can use nltk.data.load() to load the file. >>> import nltk.data >>> nltk.data.load('corpora/cookbook/mywords.txt', format='raw') 'nltkn' We need to specify format='raw' since nltk.data.load() doesn't know how to interpret .txt files. As we'll see, it does know how to interpret a number of other file formats. How it works... The nltk.data.load() function recognizes a number of formats, such as 'raw', 'pickle', and 'yaml'. If no format is specified, then it tries to guess the format based on the file's extension. In the previous case, we have a .txt file, which is not a recognized extension, so we have to specify the 'raw' format. But if we used a file that ended in .yaml, then we would not need to specify the format. Filenames passed in to nltk.data.load() can be absolute or relative paths. Relative paths must be relative to one of the paths specified in nltk.data.path. The file is found using nltk.data.find(path), which searches all known paths combined with the relative path. Absolute paths do not require a search, and are used as is. There's more... For most corpora access, you won't actually need to use nltk.data.load, as that will be handled by the CorpusReader classes covered in the following recipes. But it's a good function to be familiar with for loading .pickle files and .yaml files, plus it introduces the idea of putting all of your data files into a path known by NLTK. Loading a YAML file If you put the synonyms.yaml file into ~/nltk_data/corpora/cookbook (next to mywords.txt), you can use nltk.data. load() to load it without specifying a format. >>> import nltk.data >>> nltk.data.load('corpora/cookbook/synonyms.yaml') {'bday': 'birthday'} This assumes that PyYAML is installed. If not, you can find download and installation instructions at http://pyyaml.org/wiki/PyYAML. See also In the next recipes, we'll cover various corpus readers, and then in the Lazy corpus loading recipe, we'll use the LazyCorpusLoader, which expects corpus data to be in a corpora subdirectory of one of the paths specified by nltk.data.path. Creating a word list corpus The WordListCorpusReader is one of the simplest CorpusReader classes. It provides access to a file containing a list of words, one word per line. Getting ready We need to start by creating a word list file. This could be a single column CSV file, or just a normal text file with one word per line. Let's create a file named wordlist that looks like this: nltk corpus corpora wordnet How to do it... Now we can instantiate a WordListCorpusReader that will produce a list of words from our file. It takes two arguments: the directory path containing the files, and a list of filenames. If you open the Python console in the same directory as the files, then '.' can be used as the directory path. Otherwise, you must use a directory path such as: 'nltk_data/corpora/ cookbook'. >>> from nltk.corpus.reader import WordListCorpusReader >>> reader = WordListCorpusReader('.', ['wordlist']) >>> reader.words() ['nltk', 'corpus', 'corpora', 'wordnet'] >>> reader.fileids() ['wordlist'] How it works... WordListCorpusReader inherits from CorpusReader, which is a common base class for all corpus readers. CorpusReader does all the work of identifying which files to read, while WordListCorpus reads the files and tokenizes each line to produce a list of words. Here's an inheritance diagram: When you call the words() function, it calls nltk.tokenize.line_tokenize() on the raw file data, which you can access using the raw() function. >>> reader.raw() 'nltkncorpusncorporanwordnetn' >>> from nltk.tokenize import line_tokenize >>> line_tokenize(reader.raw()) ['nltk', 'corpus', 'corpora', 'wordnet'] There's more... The stopwords corpus is a good example of a multi-file WordListCorpusReader. Names corpus Another word list corpus that comes with NLTK is the names corpus. It contains two files: female.txt and male.txt, each containing a list of a few thousand common first names organized by gender. >>> from nltk.corpus import names >>> names.fileids() ['female.txt', 'male.txt'] >>> len(names.words('female.txt')) 5001 >>> len(names.words('male.txt')) 2943 English words NLTK also comes with a large list of English words. There's one file with 850 basic words, and another list with over 200,000 known English words. >>> from nltk.corpus import words >>> words.fileids() ['en', 'en-basic'] >>> len(words.words('en-basic')) 850 >>> len(words.words('en')) 234936 Creating a part-of-speech tagged word corpus Part-of-speech tagging is the process of identifying the part-of-speech tag for a word. Most of the time, a tagger must first be trained on a training corpus. Let us take a look at how to create and use a training corpus of part-of-speech tagged words. Getting ready The simplest format for a tagged corpus is of the form "word/tag". Following is an excerpt from the brown corpus: The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/ jj ./. Each word has a tag denoting its part-of-speech. For example, nn refers to a noun, while a tag that starts with vb is a verb. How to do it... If you were to put the previous excerpt into a file called brown.pos, you could then create a TaggedCorpusReader and do the following: >>> from nltk.corpus.reader import TaggedCorpusReader >>> reader = TaggedCorpusReader('.', r'.*.pos') >>> reader.words() ['The', 'expense', 'and', 'time', 'involved', 'are', ...] >>> reader.tagged_words() [('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), …] >>> reader.sents() [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']] >>> reader.tagged_sents() [[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]] >>> reader.paras() [[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]] >>> reader.tagged_paras() [[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]] How it works... This time, instead of naming the file explicitly, we use a regular expression, r'.*.pos', to match all files whose name ends with .pos. We could have done the same thing as we did with the WordListCorpusReader, and pass ['brown.pos'] as the second argument, but this way you can see how to include multiple files in a corpus without naming each one explicitly. TaggedCorpusReader provides a number of methods for extracting text from a corpus. First, you can get a list of all words, or a list of tagged tokens. A tagged token is simply a tuple of (word, tag). Next, you can get a list of every sentence, and also every tagged sentence, where the sentence is itself a list of words or tagged tokens. Finally, you can get a list of paragraphs, where each paragraph is a list of sentences, and each sentence is a list of words or tagged tokens. Here's an inheritance diagram listing all the major methods: There's more... The functions demonstrated in the previous diagram all depend on tokenizers for splitting the text. TaggedCorpusReader tries to have good defaults, but you can customize them by passing in your own tokenizers at initialization time. Customizing the word tokenizer The default word tokenizer is an instance of nltk.tokenize.WhitespaceTokenizer. If you want to use a different tokenizer, you can pass that in as word_tokenizer. >>> from nltk.tokenize import SpaceTokenizer >>> reader = TaggedCorpusReader('.', r'.*.pos', word_ tokenizer=SpaceTokenizer()) >>> reader.words() ['The', 'expense', 'and', 'time', 'involved', 'are', ...] Customizing the sentence tokenizer The default sentence tokenizer is an instance of nltk.tokenize.RegexpTokenize with 'n' to identify the gaps. It assumes that each sentence is on a line all by itself, and individual sentences do not have line breaks. To customize this, you can pass in your own tokenizer as sent_tokenizer. >>> from nltk.tokenize import LineTokenizer >>> reader = TaggedCorpusReader('.', r'.*.pos', sent_ tokenizer=LineTokenizer()) >>> reader.sents() [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']] Customizing the paragraph block reader Paragraphs are assumed to be split by blank lines. This is done with the default para_ block_reader, which is nltk.corpus.reader.util.read_blankline_block. There are a number of other block reader functions in nltk.corpus.reader.util, whose purpose is to read blocks of text from a stream. Their usage will be covered in more detail in the later recipe, Creating a custom corpus view, where we'll create a custom corpus reader. Customizing the tag separator If you don't want to use '/' as the word/tag separator, you can pass an alternative string to TaggedCorpusReader for sep. The default is sep='/', but if you want to split words and tags with '|', such as 'word|tag', then you should pass in sep='|'. Simplifying tags with a tag mapping function If you'd like to somehow transform the part-of-speech tags, you can pass in a tag_mapping_ function at initialization, then call one of the tagged_* functions with simplify_ tags=True. Here's an example where we lowercase each tag: >>> reader = TaggedCorpusReader('.', r'.*.pos', tag_mapping_ function=lambda t: t.lower()) >>> reader.tagged_words(simplify_tags=True) [('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), …] Calling tagged_words() without simplify_tags=True would produce the same result as if you did not pass in a tag_mapping_function. There are also a number of tag simplification functions defined in nltk.tag.simplify. These can be useful for reducing the number of different part-of-speech tags. >>> from nltk.tag import simplify >>> reader = TaggedCorpusReader('.', r'.*.pos', tag_mapping_ function=simplify.simplify_brown_tag) >>> reader.tagged_words(simplify_tags=True) [('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...] >>> reader = TaggedCorpusReader('.', r'.*.pos', tag_mapping_ function=simplify.simplify_tag) >>> reader.tagged_words(simplify_tags=True) [('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]
Read more
  • 0
  • 0
  • 21228
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-3-simple-steps-convert-mathematical-formula-model
Vedika Naik
13 Dec 2017
3 min read
Save for later

3 simple steps to convert a mathematical formula to a model

Vedika Naik
13 Dec 2017
3 min read
[box type="note" align="" class="" width=""]The following is an excerpt from the book Scala for Machine Learning, Second Edition, written by Patrick R. Nicolas. The book will help you leverage Scala and Machine Learning to study and construct systems that can learn from data. [/box] This article aims to teach how a mathematical formula can be converted into a machine learning model in three easy steps: Stackable traits enable developers to follow a strict mathematical formalism while implementing a model in Scala. Scientists use a universally accepted template to solve mathematical problems: Declare the variables relevant to the problem. Define a model (equation, algorithm, formulas…) as the solution to the problem. Instantiate the variables and execute the model to solve the problem. Let's consider the example of the concept of kernel functions, a model that consists of the composition of two mathematical functions, and its potential implementation in Scala. Step 1 – variable declaration The implementation consists of wrapping (scope) the two functions into traits and defining these functions as abstract values. The mathematical formalism is as follows: The Scala implementation is represented here: type V = Vector[Double] trait F{ val f: V => V} trait G{ val g: V => Double } Step 2 – model definition The model is defined as the composition of the two functions. The stack of traits G, F describes the type of compatible functions that can be composed using the self-referenced constraint self: G with F: Formalism h = f o g The implementation is as follows: class H {self: G with F =>def apply(v:V): Double =g(f(v))} Step 3 – instantiation The model is executed once the variable f and g are instantiated. The formalism is as follows: The implementation is as follows: val h = new H with G with F { val f: V=>V = (v: V) =>v.map(exp(_)) val g: V => Double = (v: V) =>v.sum Note: Lazy value trigger In the preceding example, the value of h(v) = g(f(v)) can be automatically computed as soon as g and f are initialized, by declaring h as a lazy value. Clearly, Scala preserves the formalism of mathematical models, making it easier for scientists and developers to migrate their existing projects written in scientific oriented languages such as R. Note: Emulation of R Most data scientists use the language R to create models and apply learning strategies. They may consider Scala as an alternative to R in some cases, as Scala preserves the mathematical formalism used in models implemented in R. In this article we explained the concept preservation of mathematical formalism. This needs to be further extended to dynamic creation of workflows using traits. Design patterns, such as the cake pattern, can be used to do this. If you enjoyed this excerpt, be sure to check out the book Scala Machine Learning, Second Edition to gain solid foundational knowledge in machine learning with Scala.  
Read more
  • 0
  • 0
  • 21194

article-image-privacy-experts-discuss-gdpr-its-impact-and-its-future-on-beth-kindigs-tech-lightning-rounds-podcast
Savia Lobo
28 May 2019
9 min read
Save for later

Privacy Experts discuss GDPR, its impact, and its future on Beth Kindig’s Tech Lightning Rounds Podcast

Savia Lobo
28 May 2019
9 min read
User’s data was being compromised even before the huge Cambridge Analytica scandal was brought to light. On May 25th, 2018, when the GDPR first came into existence in the European Union for data protection and privacy, it brought in much power to individuals over their personal data and to simplify the regulatory environment for international businesses. GDPR recently completed one year and since its inception, these have highly helped in better data privacy regulation. These privacy regulations divided companies into data processors and data controllers. Any company who has customers in the EU must comply regardless of where the company is located. In episode 6 of Tech Lightning Rounds, Beth Kindig of Intertrust speaks to experts from three companies who have implemented GDPR. Robin Andruss, the Director of Privacy at Twilio, a leader in global communications that is uniquely positioned to handle data from text messaging sent inside its applications. Tomas Sander of Intertrust, the company that invented digital rights management and has been advocating for privacy for nearly 30 years. Katryna Dow, CEO of Meeco, a startup that introduces the concept of data control for digital life. Robin Andruss’ on Twilio’s stance on privacy Twilio provides messaging, voice, and video inside mobile and web applications for nearly 40,000 companies including Uber, Lyft, Yelp, Airbnb, Salesforce and many more. “Twilio is one of the leaders in the communications platform as a service space, where we power APIs to help telecommunication services like SMS and texting, for example. A good example is when you order a Lyft or an Uber and you’ll text with a Uber driver and you’ll notice that’s not really their phone number. So that’s an example of one of our services”, Andruss explains. Twilio includes “binding corporate rules”, the global framework around privacy. He says, for anyone who’s been in the privacy space for a long time, they know that it’s actually very challenging to reach this standard. Organizations need to work with a law firm or consultancy to make sure they are meeting a bar of privacy and actually have their privacy regulations and obligations agreed to and approved by their lead DPA, Data Protection Authority in the EU, which in Twilio’s case is the Irish DPC. “We treat everyone who uses Twilio services across the board the same, our corporate rules. One rule, we don’t have a different one for the US or the EU. So I’d say that they are getting GDPR level of privacy standards when you use Twilio”, Andruss said. Talking about the California Consumer Privacy Act (CCPA), Andruss said that it’s mostly more or less targeted towards advertising companies and companies that might sell data about individuals and make money off of it, like Intelius or Spokeo or those sort of services. Beth asked Andruss on “how concerned the rest of us should be about data and what companies can do internally to improve privacy measures” to which he said, “just think about, really, what you’re putting out there, and why, and this third party you’re giving your information to when you are giving it away”. Twilio’s “no-shenanigans” and “Wear your customers’ shoes” approach to privacy Twilio’s “No-shenigans” approach to privacy encourages employees to do the right thing for their end-users and customers. Andruss explained this with an example, “You might be in a meeting, and you can say, “Is that the right thing? Do we really wanna do that? Is that the right thing to do for our customers or is that shenanigany does it not feel right?”. The “Wear your customers’ shoes.” approach is, when Twilio builds a product or thinks about something, they think about how to do the right thing for their customers. This builds trust within the customers that the organization really cares about privacy and wants to do the right thing while customers use Twilio’s tools and services. Tomas Sander on privacy pre-GDPR and post-GDPR Tomas Sander started off by explaining the basics of GDPR, what it does, and how it can help users, and so on. He also cleared a common doubt that most people have about the reach of EU’s GDPR. He said, “One of the main things that the GDPR has done is that it has an extraterritorial reach. So GDPR not only applies to European companies, but to companies worldwide if they provide goods and services to European citizens”. GDPR has “made privacy a much more important issue for many organizations” due to which GDPR has huge fines for non-compliance and that has contributed for it to be taken seriously by companies globally. Because of data breaches, “security has become a boardroom issue for many companies. Now, privacy has also become a boardroom issue”, Sander adds. He said that GDPR has been extremely effective in setting the privacy debate worldwide. Although it’s a regulation in Europe, it’s been extremely effective through its global impact on organizations and on thinking of policymakers, what they wanna do about privacy in their countries. However, talking about positive impact, Sander said that data behemoths such as Google and Facebook are still collecting data from many, many different sources, aggregating it about users, and creating detailed profiles for the purpose of selling advertising, usually, so for profit. This is why the jury is still out! “And this practice of taking all this different data, from location data to smart home data, to their social media data and so on and using them for sophisticated user profiling, that practice hasn’t recognizably changed yet”, he added. Sander said he “recently heard data protection commissioners speak at a privacy conference in Washington, and they believe that we’re going to see some of these investigations conclude this summer. And hopefully then there’ll be some enforcement, and some of the commissioners certainly believe that there will be fines”. Sander’s suggestion for users who are not much into tech is,  “I think people should be deeply concerned about privacy.” He said they can access your web browsing activities, your searches, location data, the data shared on social media, facial recognition from images, and also these days IoT and smart home data that give people intimate insights into what’s happening in your home. With this data, the company can keep a tab on what you do and perhaps create a user profile. “A next step they could take is that they don’t only observe what you do and predict what the next step is you’re going to do, but they may also try to manipulate and influence what you do. And they would usually do that for profit motives, and that is certainly a major concern. So people may not even know, may not even realize, that they’re being influenced”. This is a major concern because it really questions “our individual freedom about… It really becomes about democracy”. Sander also talked about an incident that took place in Germany where its far-right party, “Alternative For Germany”, “Alternative für Deutschland” were able to use a Facebook feature that has been created for advertisers to help it achieve the best result in the federal election for any far right-wing party in Germany after World War 2. The feature that was being used here was a feature of “look-alike” audiences. Facebook helped this party to analyze the characteristics of the 300,000 users who had liked the “Alternative For Germany”, who had liked this party. Further, from these users, it created a “look-alike” audience of another 300,000 users that were similar in characteristics to those who had already liked this party, and then they were specifically targeting ads to this group. Katrina Dow on getting people digitally aware Dow thinks, “the biggest challenge right now is that people just don’t understand what goes on under the surface”. She explains how by a simple picture sharing of a child playing in a park can impact the child’s credit rating in the future.  She says, “People don’t understand the consequences of something that I do right now, that’s digital, and what it might impact some time in the future”. She also goes on explaining how to help people make a more informed choice around the services they wanna use or argue for better rights in terms of those services, so those consequences don’t happen. Dow also discusses one of the principles of the GDPR, which is designing privacy into the applications or websites as the foundation of the design, rather than adding privacy as an afterthought. Beth asked if GDPR, which introduces some level of control, is effective. To which Dow replied, “It’s early days. It’s not working as intended right now.” Dow further explained, “the biggest problem right now is the UX level is just not working. And organizations that have been smart in terms of creating enormous amounts of friction are using that to their advantage.” “They’re legally compliant, but they have created that compliance burden to be so overwhelming, that I agree or just anything to get this screen out of the way is driving the behavior”, Dow added. She says that a part of GDPR is privacy by design, but what we haven’t seen the surface to the UX level. “And I think right now, it’s just so overwhelming for people to even work out, “What’s the choice?” What are they saying yes to? What are they saying no to? So I think, the underlying components are there and from a legal framework. Now, how do we move that to what we know is the everyday use case, which is how you interact with those frameworks”, Dow further added. To listen to this podcast and know more about this in detail, visit Beth Kindig’s official website. Github Sponsors: Could corporate strategy eat FOSS culture for dinner? Mozilla and Google Chrome refuse to support Gab’s Dissenter extension for violating acceptable use policy SnapLion: An internal tool Snapchat employees abused to spy on user data
Read more
  • 0
  • 0
  • 21178

article-image-what-the-us-china-tech-and-ai-arms-race-means-for-the-world-frederick-kempe-at-davos-2019
Sugandha Lahoti
24 Jan 2019
6 min read
Save for later

What the US-China tech and AI arms race means for the world - Frederick Kempe at Davos 2019

Sugandha Lahoti
24 Jan 2019
6 min read
Atlantic Council CEO, Frederick Kempe spoke in the World Economic Forum (WEF) in Davos, Switzerland. He talked about the Cold war between the US and China and why the countries need to co-operate and not compete in the tech arms race, in his presentation Future Frontiers of Technology Control. He began his presentation by posing a question set forth by Former US Foreign National Security Advisor Stephen Hadley, “Can the incumbent US and insurgent China become strategic collaborators and strategic competitors in this tech space at the same time?” Read also: The New AI Cold War Between China and the USA Kempe’s three framing arguments Geopolitical Competition This fusion of tech breakthroughs blurring lines of the physical, digital, and biological space is reaching an inflection point that makes it already clear that they will usher in a revolution that will determine the shape of the global economy. It will also determine which nations and political constructs may assume the commanding heights of global politics in the coming decade. Technological superiority Over the course of history, societies that dominated economic innovation and progress have dominated in international relations — from military superiority to societal progress and prosperity. On balance, technological progress has contributed to higher standards of living in most parts of the world; however, the disproportionate benefit goes to first movers. Commanding Heights The technological arms race for supremacy in the fourth industrial revolution has essentially become a two-horse contest between the United States and China. We are in the early stages of this race, but how it unfolds and is conducted will do much to shape global human relations. The shift in 2018 in US-China relations from a period of strategic engagement to greater strategic competition has also significantly accelerated the Tech arms race. China vs the US: Why China has the edge? It was Vladimir Putin, President of the Russian Federation who said that “The one who becomes the leader in Artificial Intelligence, will rule the world.” In 2017, DeepMind’s AlphaGo defeated a Chinese master in Go, a traditional Chinese game. Following this defeat, China launched an ambitious roadmap, called the next generation AI plan. The goal was to become the Global leader in AI by 2030 in theory, technology, and application. On current trajectories, in the four primary areas of AI over the next 5 years, China will emerge the winner of this new technology race. Kempe also quotes, author of the book, AI superpowers, Kai-fu Lee who argues that harnessing of the power of AI today- the electricity of the 21st century- requires abundant data, hungry entrepreneurs, AI scientists, and an AI friendly policy. He believes that China has the edge in all of these. The current AI has translated from out of the box research, where the US has expertise in, to actual implementation, where China has the edge. Per, Kai-fu Lee China already has the edge in entrepreneurship, data, and government support, and is rapidly catching up to the U.S. in expertise. The world has translated from the age of world-leading expertise (US department) to the age of data, where China wins hands down. Economists call China the Saudi Arabia of Data and with that as the fuel for AI, it has an enormous advantage. The Chinese government without privacy restrictions can gain and use data in a manner that is out of reach of any democracy. Kemper concludes that the nature of this technological arms contest may favor insurgent China rather than the incumbent US. What are the societal implications of this tech cold war He also touched upon the societal implications of AI and the cold war between the US and China. A number of jobs will be lost by 2030. Quoting from Kai-fu Lee’s book, Kempe says that Job displacement caused by artificial intelligence and advanced robotics could possibly displace up to 54 million US workers which comprise 30% of the US labor force. It could also displace up to 100 million Chinese workers which are 12% of the Chinese labor force. What is the way forward with these huge societal implications of a bi-lateral race underway? Kempe sees three possibilities. A sloppy Status Quo A status quo where China and the US will continue to cooperate but increasingly view each other with suspicion. They will manage their rising differences and distrust imperfectly, never bridging them entirely, but also not burning bridges, either between researchers, cooperations, or others. Techno Cold War China and the US turn the global tech contest into more of a zero-sum battle for global domination. They organize themselves in a manner that separates their tech sectors from each other and ultimately divides up the world. Collaborative Future - the one we hope for Nicholas Thompson and Ian Bremmer argued in a wired interview that despite the two countries’ societal difference, the US should wrap China in a tech embrace. The two countries should work together to establish international standards to ensure that the algorithms governing people’s lives and livelihoods are transparent and accountable. They should recognize that while the geopolitics of technological change is significant, even more important will be the challenges AI poses to all societies across the world in terms of job automation and the social disruptions that may come with it. It may sound utopian to expect US and China to cooperate in this manner, but this is what we should hope for. To do otherwise would be self-defeating and at the cost of others in the global community which needs our best thinking to navigate the challenges of the fourth industrial revolution. Kempe concludes his presentation with a quote by Henry Kissinger, Former US Secretary of State and National Security Advisor, “We’re in a position in which the peace and prosperity of the world depend on whether China and the US can find a method to work together, not always in agreement, but to handle our disagreements...This is the key problem of our time.” Note: All images in this article are taken from Frederick Kempe’s presentation. We must change how we think about AI, urge AI founding fathers Does AI deserve to be so Overhyped? Alarming ways governments are using surveillance tech to watch you
Read more
  • 0
  • 0
  • 21166

article-image-mongodbs-cto-eliot-horowitz-on-whats-new-in-mongodb-4-2-ops-manager-atlas-and-more
Bhagyashree R
13 Dec 2019
9 min read
Save for later

MongoDB’s CTO Eliot Horowitz on what’s new in MongoDB 4.2, Ops Manager, Atlas, and more

Bhagyashree R
13 Dec 2019
9 min read
At MongoDB.local London event that happened in September this year, Eliot Horowitz, the CTO and Co-Founder of MongoDB took to the stage to talk about the latest features in MongoDB 4.2. He also discussed the updates to Ops Manager and MongoDB Atlas, and new cloud services including integrated full-text search, the Realm development platform, and MongoDB Data Lake. MongoDB.local is a one-day educational conference that brings together people who develop MongoDB and its ecosystem, as well as fellow MongoDB users. This is where you can get a deeper knowledge of the latest in MongoDB, tools, and best practices directly from the MongoDB experts. [box type="shadow" align="" class="" width=""] Further Learning This article lists the various features that have landed in MongoDB 4.2. To get a practical understanding of administering database applications both on-premises and on the cloud, check out our book Mastering MongoDB 4.x - Second Edition by Alex Giamas. [/box] Exciting features in MongoDB 4.2 Distributed transactions MongoDB 4.0 came with support for multi-document transactions on replica sets. This support was extended in MongoDB 4.2 by introducing distributed transactions. These add support for multi-document transactions on sharded clusters and also include the existing support for multi-document transactions on replica sets. Distributed transactions have the same syntax and semantics as the replica set transactions. They are fully ACID compliant and have conversational syntax. Another important update is that there is now no limit to how big a transaction can be. “It is just a matter of how much hardware you have and what the hardware can handle,” Horowitz adds. Also, previously the sharding system did not allow changing the shard key as it often meant moving a document from one shard to another. Starting with MongoDB 4.2, you are allowed to change the shard key and that too very easily. Now, if you change the value of a shard key and a document is required to be moved from one shard to another, MongoDB will automatically wrap that update behind the scenes inside of a transaction. This is one step towards ensuring that there is no “difference between a sharded MongoDB cluster and a replica set,” Horowitz shared. Another function that Horowitz talked about was global cluster locale reassignment. For instance, suppose you have geo zone sharding with some data residing in Europe and some other data in the US. When the users move, you can just change the value of their location field and that data will be automatically moved from Europe to the US using a transaction. Retryable reads and writes Retryable reads and writes enable the MongoDB drivers to automatically retry certain transactions if they encounter network errors or if they were not able to find a healthy primary in the replica sets or sharded cluster. Starting with MongoDB 4.2, this feature is enabled by default. One of the main goals of this feature is ensuring that whenever there is some change in the infrastructure whether it is for planned maintenance or know crashes, the application code shouldn’t care or be affected. Explaining through an example, he shared, “You have got a web page that does 20 different database operations. Rather than having to reload the entire thing, rather than having to wrap the entire web page in some sort of loop the driver under the covers can just say this I am going to retry this operation.” He adds, “So if a write fails it will retry that write automatically and will have a contract with the server to guarantee that every write happens once and only once.” Much more expressive updates MongoDB’s query language is now much richer and expressive with the support for aggregations and other modern use-cases including geo-based search, graph search, and text search. You can do things like sums, handle arrays, and other math directly through an update statement. “Let’s imagine you’ve got a document and all you want to do is to set the value of A to the value of B+C in every document. Previously, you couldn’t do that and now you can do very simple arithmetic in MongoDB.” On-demand materialized views The MongoDB aggregation pipeline, a framework for data aggregation, consists of stages. Each stage is responsible for transforming a document as they pass through the pipeline. MongoDB 4.2 introduces a new stage called ‘$merge’ that allows you to create collections based on aggregation and update those created collections efficiently. The $out stage already allows creating collections based on an aggregation. It takes the results of an aggregation and puts it into a new collection. But the difference is that it replaces the collections entire contents with the new results. As it regenerates the entire collection every time, it ends up consuming a lot of CPU and IO. The new $merge feature can incorporate the pipeline results into an existing output collection rather than fully replacing the collection. This enables users to create on-demand materialized views, where the content of the output collection is perennially updated “maybe every minute, every hour, or maybe every day depending on the use case.” Wildcard indexes In MongoDB 4.2, we have wildcard indexes that let you index an entire document or a subset of a document. It is introduced to support queries against unknown or arbitrary fields. Horowitz explains, “Previously, you were required to either add an index for every attribute you care about or put these into an array...With wild card indexes, you can actually just say “hey index the entire document or index this entire subset of the document.” What will happen is we will actually index everything in there so you can just do any query that you want.” However, keep in mind that wildcard indexes are not really designed to replace workload-based index planning. It is suitable for cases when you have polymorphic patterns in your data. Examples of data containing polymorphic pattern include product catalogs, e-commerce, social data, and IoT applications. Modern operations Along with offering such great features, it is also important for a database to provide developers a great operational experience. It should have great availability, a powerful monitoring and alerting system, backup, self-service, and APIs. To manage MongoDB we have two options: MongoDB Ops Manager and MongoDB Atlas. MongoDB Ops Manager MongoDB Ops Manager is the “best way to run MongoDB on-premises.” Its backup system offers great features such as point-in-time restore and queryable snapshots. In previous versions, however, it was a complex system and in many cases expensive to run. Starting with MongoDB 4.2, it was completely overhauled to be much simpler. Now, there is no concept of “heads.” This release also introduces a new Kubernetes operator for Ops Manager. On-premise users are moving to private cloud and for that, they mainly rely on Kubernetes. This is why you now have the Kubernetes operator for Ops Manager. It will enable you to directly control the Ops Manager through your Kubernetes interfaces. MongoDB Atlas MongoDB Atlas is a fully-managed MongoDB as a service. It now has integration with Terraform, a tool used for building, changing, and versioning infrastructure. There is also a new feature called Atlas Auto Scaling for fully-automated capacity management. Once you enable the feature, Atlas will monitor resource utilization metrics in real-time and automatically scale up or down your VM. In terms of security, MongoDB Atlas is now ISO 27001 certified and PCI compliant. It also supports field-level encryption (FLE) beta. This enables applications to encrypt fields in documents before transmitting data to the server. This encryption happens on the client-side and is completely transparent to the developers. Another key update in this release is the introduction of MongoDB Atlas Full-Text Search (Beta). Atlas now has a rich-text search functionality against your fully managed MongoDB databases. Horowitz explains, “Today, you typically have to take in MongoDB and synchronize it to some other system (such as Elasticsearch) and under those systems is Apache Lucene.” The team decided to remove this “middleman” to let users go “straight from MongoDB to Lucene.” Horowitz also talked about MongoDB Atlas Data Lake that enables you to quickly query data in any format on Amazon S3 using the MongoDB Query Language (MQL). It lets you run regular MongoDB queries against data in Amazon S3. It supports any file format including JSON, BSON, CSV, TSV, Avro, and Parquet formats. MongoDB Realm In May this year, MongoDB acquired Realm, a database for mobile applications. Horowitz gave some insight into what future plans he has for Realm. “MongoDB is investing in a lot of the things that Realm users have been asking for a long time or taking a lot of the resources we have and making sure that we can accelerate the core realm roadmap as fast as possible.” Among the new features that RealmDB will get are new data types for unstructured data such as Dicts, Sets, Any/Mixed type for polymorphic data. It will have cascading deletes, inheritance, analytics and transformational queries, support for more platforms. Horowitz plans to integrate Realm more tightly with MongoDB and together they will be called MongoDB Realm.  It will be “the best way to build data-intensive applications anywhere.” This article walked you through the new features in MongoDB 4.2, Ops Manager, Atlas, and much more presented by Eliot Horowitz in his MongoDB.local talk. Check out our book Mastering MongoDB 4.x - Second Edition by Alex Giamas to become a successful MongoDB expert.  This book dives into niche areas of managing databases (such as modeling and querying databases) along with various administration techniques in MongoDB, and much more. MongoDB is partnering with Alibaba Homebrew removes MongoDB from core formulas MongoDB withdraws controversial Server Side Public License from the Open Source Initiative’s approval process
Read more
  • 0
  • 0
  • 20942
article-image-production-ready-pytorch-1-0-preview-release-is-here-with-torch-jit-c10d-distributed-library-c-api
Aarthi Kumaraswamy
02 Oct 2018
4 min read
Save for later

PyTorch 1.0 preview release is production ready with torch.jit, c10d distributed library, C++ API

Aarthi Kumaraswamy
02 Oct 2018
4 min read
Back in May, the PyTorch team shared their roadmap for PyTorch 1.0 release and highlighted that this most anticipated version will not only continue to provide stability and simplicity of use to its users, but will also make it production ready while making it a hassle-free migration experience for its users. Today, Facebook announced the release of PyTorch 1.0 RC1. The official announcement states, “PyTorch 1.0 accelerates the workflow involved in taking breakthrough research in artificial intelligence to production deployment. With deeper cloud service support from Amazon, Google, and Microsoft, and tighter integration with technology providers ARM, Intel, IBM, NVIDIA, and Qualcomm, developers can more easily take advantage of PyTorch’s ecosystem of compatible software, hardware, and developer tools. The more software and hardware that is compatible with PyTorch 1.0, the easier it will be for AI developers to quickly build, train, and deploy state-of-the-art deep learning models.” PyTorch is an open-source Python-based deep learning framework which provides powerful GPU acceleration. PyTorch is known for advanced indexing and functions, imperative style, integration support and API simplicity. This is one of the key reasons why developers prefer PyTorch for research and hackability. On the downside, it has struggled with adoption in production environments. The pyTorch team acknowledged this in their roadmap and have worked on improving this aspect significantly in pyTorch 1.0 not just in terms of improving the library but also by enriching its ecosystems but partnering with key software and hardware vendors. “One of its biggest downsides has been production-support. What we mean by production-support is the countless things one has to do to models to run them efficiently at massive scale: exporting to C++-only runtimes for use in larger projects optimizing mobile systems on iPhone, Android, Qualcomm and other systems using more efficient data layouts and performing kernel fusion to do faster inference (saving 10% of speed or memory at scale is a big win) quantized inference (such as 8-bit inference)”, stated the pyTorch team in their roadmap post. Below are some key highlights of this major milestone for PyTorch. JIT The JIT is a set of compiler tools for bridging the gap between research in PyTorch and production. It includes a language called Torch Script ( a subset of Python), and two ways (Tracing mode and Script mode) in which the existing code can be made compatible with the JIT. Torch Script code can be aggressively optimized and it can be serialized for later use in the new C++ API, which doesn't depend on Python at all. torch.distributed new "C10D" library The torch.distributed package and torch.nn.parallel.DistributedDataParallel module are backed by the new "C10D" library. The main highlights of the new library are: C10D is performance driven and operates entirely asynchronously for all backends: Gloo, NCCL, and MPI. Significant Distributed Data Parallel performance improvements especially for slower network like ethernet-based hosts Adds async support for all distributed collective operations in the torch.distributed package. Adds send and recv support in the Gloo backend C++ Frontend [API Unstable] The C++ frontend is a pure C++ interface to the PyTorch backend that follows the API and architecture of the established Python frontend. It is intended to enable research in high performance, low latency and bare metal C++ applications. It provides equivalents to torch.nn, torch.optim, torch.data and other components of the Python frontend. The C++ frontend is marked as "API Unstable" as part of PyTorch 1.0. This means it is ready to be used for building research applications, but still has some open construction sites that will stabilize over the next month or two. In other words, it is not ready for use in production, yet. N-dimensional empty tensors, a collection of new operators inspired from numpy and scipy and new distributions such as Weibull, Negative binomial and multivariate log gamma distributions have been introduced. There have also been a lot of breaking changes, bug fixes, and other improvements made to pyTorch 1.0. For more details read the official announcement and also the official release notes for pyTorch. What is PyTorch and how does it work? Build your first neural network with PyTorch [Tutorial] Is Facebook-backed PyTorch better than Google’s TensorFlow? Can a production-ready Pytorch 1.0 give TensorFlow a tough time?
Read more
  • 0
  • 0
  • 20923

article-image-training-rnns-time-series-forecasting
Aaron Lazar
22 Nov 2017
7 min read
Save for later

Training RNNs for Time Series Forecasting

Aaron Lazar
22 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This tutorial has been taken from the book Practical Time Series Analysis by Dr. PKS Prakash and Dr. Avishek Pal.[/box] RNNs are notoriously difficult to be trained. Vanilla RNNs suffer from vanishing and exploding gradients that give erratic results during training. As a result, RNNs have difficulty in learning long-range dependencies. For time series forecasting, going too many timesteps back in the past would be problematic. To address this problem, Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU), which are special types of RNNs, have been introduced. We will use LSTM and GRU to develop the time series forecasting models. Before that, let's review how RNNs are trained using Backpropagation Through Time (BPTT), a variant of the backpropagation algorithm. We will find out how vanishing and exploding gradients arise during BPTT. Let's consider the computational graph of the RNN that we have been using for the time series forecasting. The gradient computation is shown in the following figure: Figure: Back-Propagation Through Time for deep recurrent neural network with p timesteps For weights U, there is one path to compute the partial derivative, which is given as follows: However, due to the sequential structure of the RNN, there are multiple paths connecting the weights and the loss and the loss. Hence, the partial derivative is the sum of partial derivatives along the individual paths that start at the loss node and ends at every timestep node in the computational graph: The technique of computing the gradients for the weights by summing over paths connecting the loss node and every timestep node is backpropagation through time, which is a special case of the original backpropagation algorithm. The problem of vanishing gradients in long-range RNNs is due to the multiplicative terms in the BPTT gradient computations. Now let's examine a multiplicative term from one of the preceding equations. The gradients along the computation path connecting the loss node and ith timestep is This chain of gradient multiplication is ominously long to model long-range dependencies and this is where the problem of vanishing gradient arises. The activation function for the internal state [0,1) is either a tanh or sigmoid. The first derivative of tanh is which is bound in (0,¼]. In the sigmoid function, the first-order derivative is which is bound in si . Hence, the gradients W  are positive fractions. For long-range timesteps, multiplying these fractional gradients diminishes the final product to zero and there is no gradient flow from a long-range timestep. Due to the negligibly low values of the gradients, the weights do not update and hence the neurons are said to be saturated. It is noteworthy that U, ∂si/∂s(i-1), and ∂si/∂W are matrices and therefore the partial derivatives t and st are computed on matrices. The final output is computed through matrix multiplications and additions. The first derivative on matrices is called Jacobian. If any one element of the Jacobian matrix is a fraction, then for a long-range RNN, we would see a vanishing gradient. On the other hand, if an element of the Jacobian is greater than one, the training process suffers from exploding gradients. Solving the long-range dependency problem We have seen in the previous section that it is difficult for vanilla RNNs to effectively learn long-range dependencies due to vanishing and exploding gradients. To address this issue, Long Short Term Memory network was developed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. Gated Recurrent Unit was introduced in 2014 and gives a simpler version of LSTM. Let's review how LSTM and GRU solve the problem of learning long-range dependencies. Long Short Term Memory (LSTM) LSTM introduces additional computations in each timestep. However, it can still be treated as a black box unit that, for timestep (ht), returns the state internal (ft) and this is forwarded to the next timestep. However, internally, these vectors are computed differently. LSTM introduces three new gates: the input (ot) forget (gt), and output (ct) gates. Every timestep also has an internal hidden state (ht) and internal memory These new units are computed as follows: Now, let's understand the computations. The gates are generated through sigmoid activations ht that limits their values within ft. Hence, these act as gates by letting out only a fraction of the value when multiplied by another variable. The input gate ot controls the fraction of the newly computed input to keep. The forget gate ct determines the effect of the previous time step and the output gate ft controls how much of the internal state to let out. The internal hidden state is calculated from the input to the current timestep and output of the previous timestep. Note that this is the same as computing the internal state in a vanilla RNN. ht is the internal memory unit of the current timestep and considers the memory from the previous step but downsized by a fraction st and the effect of the internal hidden state but mixed with the input gate ot. Finally,  zt would be passed to the next timestep and computed from the current internal memory and the output gate rt. The input, forget, and output gates are used to selectively include the previous memory and the current hidden state that is computed in the same manner as in vanilla RNNs. The gating mechanism of LSTM allows memory transfer from over long-range timesteps. Gated Recurrent Units (GRU) GRU is simpler than LSTM and has only two internal gates, namely, the update gate (zt) and the reset gate (rt). The computations of the update and reset gates are as follows: zt = σ(Wzxt + Uzst-1) rt = σ(Wrxt + Urst-1) The state st of the timestep t is computed using the input xt, state st-1 from the previous timestep, the update, and the reset gates: The update being computed by a sigmoid function determines how much of the previous step's memory is to be retained in the current timestep. The reset gate controls how to combine the previous memory with the current step's input. Compared to LSTM, which has three gates, GRU has two. It does not have the output gate and the internal memory, both of which are present in LSTM. The update gate in GRU determines how to combine the previous memory with the current memory and combines the functionality achieved by the input and forget gates of LSTM. The reset gate, which combines the effect of the previous memory and the current input, is directly applied to the previous memory. Despite a few differences in how memory is transmitted along the sequence, the gating mechanisms in both LSTM and GRU are meant to learn long-range dependencies in data. So, which one do we use - LSTM or GRU? Both LSTM and GRU are capable of handling memory over long RNNs. However, a common question is which one to use? LSTM has been long preferred as the first choice for language models, which is evident from their extensive use in language translation, text generation, and sentiment classification. GRU has the distinct advantage of fewer trainable weights as compared to LSTM. It has been applied to tasks where LSTM has previously dominated. However, empirical studies show that neither approach outperforms the other in all tasks. Tuning model hyperparameters such as the dimensionality of the hidden units improves the predictions of both. A common rule of thumb is to use GRU in cases having less training data as it requires less number of trainable weights. LSTM proves to be effective in case of large datasets such as the ones used to develop language translation models. This article is an excerpt from the book Practical Time Series Analysis. For more techniques, grab the book now!
Read more
  • 0
  • 0
  • 20870

article-image-sql-server-analysis-services-administering-and-monitoring-analysis-services
Packt
20 Dec 2013
19 min read
Save for later

SQL Server Analysis Services – Administering and Monitoring Analysis Services

Packt
20 Dec 2013
19 min read
(For more resources related to this topic, see here.) If your environment has only one or a handful of SSAS instances, they can be managed by the same database administrators managing SQL Server and other database platforms. In large enterprises, there could be hundreds of SSAS instances managed by dedicated SSAS administrators. Regardless of the environment, you should become familiar with the configuration options as well as troubleshooting methodologies. In large enterprises, you might also be required to automate these tasks using the Analysis Management Objects (AMO) code. Analysis Services is a great tool for building business intelligence solutions. However, much like any other software, it does have its fair share of challenges and limitations. Most frequently encountered enterprise business intelligence system goals include quick provision of relevant data to the business users and assuring excellent query performance. If your cubes serve a large, global community of users, you will quickly learn that SSAS is optimized to run a single query as fast as possible. Once users send a multitude of heavy queries in parallel, you can expect to see memory, CPU, and disk-related performance counters to quickly rise, with a corresponding increase in query execution duration which, in turn, worsens user experience. Although you could build aggregations to improve query performance, doing so will lengthen cube processing time, and thereby, delay the delivery of essential data to decision makers. It might also be tempting to consider using ROLAP storage mode in lieu of MOLAP so that processing times are shorter, but MOLAP queries usually outperform ROLAP due to heavy compression rates. Hence, figuring out the right storage mode and appropriate level of aggregations is a great balancing act. If you cannot afford using ROLAP, and query performance is paramount to successful cube implementation, you should consider scaling your solution. You have two options for scaling, given as follows: Scaling up: This option means purchasing servers with more memory, more CPU cores, and faster disk drives Scaling out: This option means purchasing several servers of approximately the same capacity and distributing the querying workload across multiple servers using a load balancing tool SSAS lends itself best to the second option—scaling out. Later in this article you will learn how to separate processing and querying activities and how to ensure that all servers in the querying pool have the same data. SSAS instance configuration options All Analysis Services configuration options are available in the msmdsrv.ini file found in the config folder under the SSAS installation directory. Instance administrators can also modify some, but not all configuration properties, using SQL Server Management Studio (SSMS). SSAS has a multitude of properties that are undocumented—this normally means that such properties haven't undergone thorough testing, even by the software's developers. Hence, if you don't know exactly what the configuration setting does, it's best to leave the setting at default value. Even if you want to test various properties on a sandbox server, make a copy of the configuration file prior to applying any changes. How to do it... To modify the SSAS instance settings using the configuration file, perform the following steps: Navigate to the config folder within your Analysis Services installation directory. By default, this will be C:\Program Files\Microsoft SQL Server\MSAS11.instance_name\OLAP\Config. Open the msmdsrv.ini file using Notepad or another text editor of your choice. The file is in the XML format, so every property is enclosed in opening and closing tags. Search for the property of interest, modify its value as desired, and save the changes. For example, in order to change the upper limit of the processing worker threads, you would look for the <ThreadPool><Process><MaxThreads> tag sequence and set the values as shown in the following excerpt from the configuration file: <Process>       <MinThreads>0</MinThreads>       <MaxThreads>250</MaxThreads>      <PriorityRatio>2</PriorityRatio>       <Concurrency>2</Concurrency>       <StackSizeKB>0</StackSizeKB>       <GroupAffinity/>     </Process> To change the configuration using SSMS, perform the following steps: Connect to the SSAS instance using the instance administrator account and choose Properties. If your account does not have sufficient permissions, you will get an error that only administrators can edit server properties. Change the desired properties by altering the Value column on the General page of the resulting dialog, as shown in the following screenshot: Advanced properties are hidden by default. You must check the Show Advanced (All) Properties box to see advanced properties. You will not see all the properties in SSMS even after checking this box. The only way to edit some properties is by editing msmdsrv.ini as previously discussed. Make a note of the Reset Default button in the bottom-right corner. This button comes in handy if you've forgotten what the configuration values were before you changed them and want to revert to the default settings. The default values are shown in the dialog box, which can provide guidance as to which properties have been altered. Some configuration settings require restarting the SSAS instance prior to being executed. If this is the case, the Restart column will have a value of Yes. Once you're happy with your changes, click on OK and restart the instance if necessary. You can restart SSAS using the Services.msc applet from the command line using the NET STOP / NET START commands, or directly in SSMS by choosing the Restart option after right-clicking on the instance. How it works... Discussing every SSAS property would make this article extremely lengthy; doing so is well beyond the scope of the book. Instead, in this section, I will summarize the most frequently used properties. Often, synchronization has to copy large partition datafiles and aggregation files. If the timeout value is exceeded, synchronization fails. Increase the value of the <Network><Listener><ServerSendTimeout> and <Network><Listener><ServerReceiveTimeout> properties to allow a longer time span for copying each file. By default, SSAS can use a lazy thread to rebuild missing indexes and aggregations after you process partition data. If the <OLAP><LazyProcessing><Enabled> property is set to 0, the lazy thread is not used for building missing indexes—you must use an explicit processing command instead. The <OLAP><LazyProcessing><MaxCPUUsage> property throttles the maximum CPU that could be used by the lazy thread. If efficient data delivery is your topmost priority, you can exploit the ProcessData option instead of ProcessFull. To build aggregations after the data is loaded, you must set the partition's ProcessingMode property to LazyAggregations. The SSAS formula engine is single threaded, so queries that perform heavy calculations will only use one CPU core, even on a multiCPU computer. The storage engine is multithreaded; hence, queries that read many partitions will require many CPU cycles. If you expect storage engine heavy queries, you should lower the CPU usage threshold for LazyAggregations. By default, Analysis Services records subcubes requested for every 10th query in the query log table. If you'd like to design aggregations based on query logs, you should change the <Log><QueryLog><QueryLogSampling> property value to 1 so that the SSAS logs subcube requests for every query. SSAS can use its own memory manager or the Windows memory manager. If your SSAS instance consistently becomes unresponsive, you could try using the Windows memory manager. Set <Memory><MemoryHeapType> to 2 and <Memory><HeapTypeForObjects> to 0. The Analysis Services memory manager values are 1 for both the properties. You must restart the SSAS service for the changes to these properties to take effect. The <Memory><PreAllocate> property specifies the percentage of total memory to be reserved at SSAS startup. SSAS normally allocates memory dynamically as it is required by queries and processing jobs. In some cases, you can achieve performance improvement by allocating a portion of the memory when the SSAS service starts. Setting this value will increase the time required to start the service. The memory will not be released back to the operating system until you stop the SSAS service. You must restart the SSAS service for changes to this property to take effect. The <Log><FlightRecorder><FileSizeMB>and <Log><FlightRecorder><LogDurationSec> properties control the size and age of the FlightRecorder trace file before it is recycled. You can supply your own trace definition file to include the trace events and columns you wish to monitor using the <Log><FlightRecorder><TraceDefinitionFile> property. If FlightRecorder collects useful trace events, it can be an invaluable troubleshooting tool. By default, the file is only allowed to grow to 10 MB or 60 minutes. Long processing jobs can take up much more space, and their duration could be much longer than 60 minutes. Hence, you should adjust the settings as necessary for your monitoring needs. You should also adjust the trace events and columns to be captured by FlightRecorder. You should consider adjusting the duration to cover three days (in case the issue you are researching happens over a weekend). The <Memory><LowMemoryLimit> property controls the point—amount of memory used by SSAS—at which the cleaner thread becomes actively engaged in reclaiming memory from existing jobs. Each SSAS command (query, processing, backup, synchronization, and so on) is associated with jobs that run on threads and use system resources. We can lower the value of this setting to run more jobs in parallel (though the performance of each job could suffer). Two properties control the maximum amount of memory that a SSAS instance could use. Once memory usage reaches the value specified by <Memory><TotalMemoryLimit>, the cleaner thread becomes particularly aggressive at reclaiming memory. The <Memory><HardMemoryLimit> property specifies the absolute memory limit—SSAS will not use memory above this limit. These properties are useful if you have SSAS and other applications installed on the same server computer. You should reserve some memory for other applications and the operating system as well. When HardMemoryLimit is reached, SSAS will disconnect the active sessions, advising that the operation was cancelled due to memory pressure. All memory settings are expressed in percentages if the values are less than or equal to 100. Values above 100 are interpreted as kilobytes. All memory configuration changes require restart of the SSAS service to take effect. In the prior releases of Analysis Services, you could only specify the minimum and maximum number of threads used for queries and processing jobs. With SSAS 2012, you can also specify the limits for the input/output job threads using the <ThreadPool><IOProcess> property. The <Process><IndexBuildThreshold> property governs the minimum number of rows within a partition for which SSAS will build indexes. The default value is 4096. SSAS decides which partitions it needs to scan for each query based on the partition index files. If the partition does not have indexes, it will be scanned for all the queries. Normally, SSAS can read small partitions without greatly affecting query performance. But if you have many small partitions, you should lower the threshold to ensure each partition has indexes. The <Process><BufferRecordLimit> and <Process><BufferMemoryLimit> properties specify the number of records for each memory buffer and the maximum percentage of memory that can be used by a memory buffer. Lower the value of these properties to process more partitions in parallel. You should monitor processing using the SQL Profiler to see if some partitions included in the processing batch are being processed while the others are in waiting. The <ExternalConnectionTimeout> and <ExternalCommandTimeout> properties control how long an SSAS command should wait for connecting to a relational database or how long SSAS should wait to execute the relational query before reporting timeout. Depending on the relational source, it might take longer than 60 seconds (that is, the default value) to connect. If you encounter processing errors without being able to connect to the relational source, you should increase the ExternalConnectionTimeout value. It could also take a long time to execute a query; by default, the processing query will timeout after one hour. Adjust the value as needed to prevent processing failures. The contents of the <AllowedBrowsingFolders> property define the drives and directories that are visible when creating databases, collecting backups, and so on. You can specify multiple items separated using the pipe (|) character. The <ForceCommitTimeout> property defines how long a processing job's commit operation should wait prior to cancelling any queries/jobs which may interfere with processing or synchronization. A long running query can block synchronization or processing from committing its transaction. You can adjust the value of this property from its default value of 30 seconds to ensure that processing and queries don't step on each other. The <Port> property specifies the port number for the SSAS instance. You can use the hostname followed by a colon (:) and a port number for connecting to the SSAS instance in lieu of the instance name. Be careful not to supply the port number used by another application; if you do so, the SSAS service won't start. The <ServerTimeout> property specifies the number of milliseconds after which a query will timeout. The default value is 1 hour, which could be too long for analytical queries. If the query runs for an hour, using up system resources, it could render the instance unusable by any other connection. You can also define a query timeout value in the client application's connection strings. Client setting overrides the server-level property. There's more... There are many other properties you can set to alter SSAS instance behavior. For additional information on configuration properties, please refer to product documentation at http://technet.microsoft.com/en-us/library/ms174556.aspx. Creating and dropping databases Only SSAS instance administrators are permitted to create, drop, restore, detach, attach, and synchronize databases. This recipe teaches administrators how to create and drop databases. Getting ready Launch SSMS and connect to your Analysis Services instance as an administrator. If you're not certain that you have administrative properties to the instance, right-click on the SSAS instance and choose Properties. If you can view the instance's properties, you are an administrator; otherwise, you will get an error indicating that only instance administrators can view and alter properties. How to do it... To create a database, perform the following steps: Right-click on the Databases folder and choose New Database. Doing so launches the New Database dialog shown in the following screenshot. Specify a descriptive name for the database, for example, Analysis_Services_Administration. Note that the database name can contain spaces. Each object has a name as well as an identifier. The identifier value is set to the object's original name and cannot be changed without dropping and recreating the database; hence, it is important to come up with a descriptive name from the very beginning. You cannot create more than one database with the same name on any SSAS instance. Specify the storage location for the database. By default, the database will be stored under the \OLAP\DATA folder of your SSAS installation directory. The only compelling reason to change the default is if your data drive is running out of disk space and cannot support the new database's storage requirements. Specify the impersonation setting for the database. You could also specify the impersonation property for each data source. Alternatively, each data source can inherit the DataSourceImpersonationInfo property from the database-level setting. You have four choices as follows: Specific user name (must be a domain user) and password: This is the most secure option but requires updating the password if the user changes the password Analysis Services service account Credentials of the current user: This option is specifically for data mining Default: This option is the same as using the service account option Specify an optional description for the database. As with majority of other SSMS dialogs, you can script the XMLA command you are about to execute by clicking on the Script button. To drop an existing database, perform the following steps: Expand the Databases folder on the SSAS instance, right-click on the database, and choose Delete. The Delete objects dialog allows you to ignore errors; however, it is not applicable to databases. You can script the XMLA command if you wish to review it first. An alternative way of scripting the DELETE command is to right-click on the database and navigate to Script database as | Delete To | New query window. Monitoring SSAS instance using Activity Viewer Unlike other database systems, Analysis Services has no system databases. However, administrators still need to check the activity on the server, ensure that cubes are available and can be queried, and there is no blocking. You can exploit a tool named Analysis Services Activity Viewer 2008 to monitor SSAS Versions 2008 and later, including SSAS 2012. This tool is owned and maintained by the SSAS community and can be downloaded from www.codeplex.com. Activity Viewer allows viewing active and dormant sessions, current XMLA and MDX queries, locks, as well as CPU and I/O usage by each connection. Additionally, you can define rules to raise alerts when a particular condition is met. How to do it... To monitor an SSAS instance using Activity Viewer, perform the following steps: Launch the application by double-clicking on ActivityViewer.exe. Click on the Add New Connection button on the Overview tab. Specify the hostname and instance name or the hostname and port number for the SSAS instance and then click on OK. For each SSAS instance you connect to, Activity Viewer adds a new tab. Click on the tab for your SSAS instance. Here, you will see several pages as shown in the following screenshot: Alerts: This page shows any sessions that met the condition found in the Rules page. Users: This page displays one row for each user as well as the number of sessions, total memory, CPU, and I/O usage. Active Sessions: This page displays each session that is actively running an MDX, Data Mining Extensions (DMX), or XMLA query. This page allows you to cancel a specific session by clicking on the Cancel Session button. Current Queries: This page displays the actual command's text, number of kilobytes read and written by the command, and the amount of  CPU time used by the command. This page allows you to cancel a specific query by clicking on the Cancel Query button. Dormant Sessions: This page displays sessions that have a connection to the SSAS instance but are not currently running any queries. You can also disconnect a dormant session by clicking on the Cancel Session button. CPU: This page allows you to review the CPU time used by the session as well as the last command executed on the session. I/O: This page displays the number of reads and writes as well as the kilobytes read and written by each session. Objects: This page shows the CPU time and number of reads affecting each dimension and partition. This page also shows the full path to the object's parent; this is useful if you have the same naming convention for partitions in multiple measure groups. Not only do you see the partition name, but also the full path to the partition's measure group. This page also shows the number of aggregation hits for each partition. If you find that a partition is frequently queried and requires many reads, you should consider building aggregations for it. Locks: This page displays the locks currently in place, whether already granted or waiting. Be sure to check the Lock Status column—the value of 0 indicates that the lock request is currently blocked. Rules: This page allows defining conditions that will result in an alert. For example, if the session is idle for over 30 minutes or if an MDX query takes over 30 minutes, you should get alerted. How it works... Activity Viewer monitors Analysis Services using Dynamic Management Views (DMV). In fact, capturing queries executed by Activity Viewer using SQL Server Profiler is a good way of familiarizing yourself with SSAS DMV's. For example, the Current Queries page checks the $system.DISCOVER_COMMANDS DMV for any actively executing commands by running the following query: SELECT SESSION_SPID,COMMAND_CPU_TIME_MS,COMMAND_ELAPSED_TIME_MS,   COMMAND_READ_KB,COMMAND_WRITE_KB, COMMAND_TEXT FROM $system.DISCOVER_COMMANDS WHERE COMMAND_ELAPSED_TIME_MS > 0 ORDER BY COMMAND_CPU_TIME_MS DESC The Active Sessions page checks the $system.DISCOVER_SESSIONS DMV with the session status set to 1 using the following query: SELECT SESSION_SPID,SESSION_USER_NAME, SESSION_START_TIME,   SESSION_ELAPSED_TIME_MS,SESSION_CPU_TIME_MS, SESSION_ID FROM $SYSTEM.DISCOVER_SESSIONS WHERE SESSION_STATUS = 1 ORDER BY SESSION_USER_NAME DESC The Dormant sessions page runs a very similar query to that of the Active Sessions page, except it checks for sessions with SESSION_STATUS=0—sessions that are currently not running any queries. The result set is also limited to top 10 sessions based on idle time measured in milliseconds. The Locks page examines all the columns of the $system.DISCOVER_LOCKS DMV to find all requested locks as well as lock creation time, lock type, and lock status. As you have already learned, the lock status of 0 indicates that the request is blocked, whereas the lock status of 1 means that the request has been granted. Analysis Services blocking can be caused by conflicting operations that attempt to query and modify objects. For example, a long running query can block a processing or synchronization job from completion because processing will change the data values. Similarly, a command altering the database structure will block queries. The database administrator or instance administrator can explicitly issue the LOCK XMLA command as well as the BEGIN TRANSACTION command. Other operations request locks implicitly. The following table documents most frequently encountered Analysis Services lock types: Lock type identifier Description Acquired for 2 Read lock Processing to read metadata. 4 Write lock Processing to write data after it is read from relational sources. 8 Commit shared During the processing, restore or synchronization commands. 16 Commit exclusive Committing the processing, restore, or synchronization transaction when existing files are replaced by new files.  
Read more
  • 0
  • 0
  • 20720
article-image-how-to-configure-metricbeat-for-application-and-server-infrastructure
Pravin Dhandre
23 Feb 2018
8 min read
Save for later

How to Configure Metricbeat for Application and Server infrastructure

Pravin Dhandre
23 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book provides detailed understanding in how you can employ Elastic Stack in performing distributed analytics along with resolving various data processing challenges.[/box] In today’s tutorial, we will show the step-by-step configuration of Metricbeat, a Beats platform for monitoring server and application infrastructure. Configuring Metricbeat   The configurations related to Metricbeat are stored in a configuration file named metricbeat.yml, and it uses YAML syntax. The metricbeat.yml file contains the following: Module configuration General settings Output configuration Processor configuration Path configuration Dashboard configuration Logging configuration Let's explore some of these sections. Module configuration Metricbeat comes bundled with various modules to collect metrics from the system and applications such as Apache, MongoDB, Redis, MySQL, and so on. Metricbeat provides two ways of enabling modules and metricsets: Enabling module configs in the modules.d directory Enabling module configs in the metricbeat.yml file Enabling module configs in the modules.d directory The modules.d directory contains default configurations for all the modules available in Metricbeat. The configuration specific to a module is stored in a .yml file with the name of the file being the name of the module. For example, the configuration related to the MySQL module would be stored in the mysql.yml file. By default, excepting the system module, all other modules are disabled. To list the modules that are available in Metricbeat, execute the following command: Windows: D:packtmetricbeat-6.0.0-windows-x86_64>metricbeat.exe modules list Linux: [locationOfMetricBeat]$./metricbeat modules list The modules list command displays all the available modules and also lists which modules are currently enabled/disabled. As each module comes with the default configurations, make the appropriate changes in the module configuration file. The basic configuration for mongodb module will look as follows: - module: mongodb metricsets: ["dbstats", "status"] period: 10s hosts: ["localhost:27017"] username: user password: pass To enable it, execute the modules enable command, passing one or more module name. For example: Windows: D:packtmetricbeat-6.0.0-windows-x86_64>metricbeat.exe modules enable redis mongodb Linux: [locationOfMetricBeat]$./metricbeat modules enable redis mongodb Similar to disable modules, execute the modules disable command, passing one or more module names to it. For example: Windows: D:packtmetricbeat-6.0.0-windows-x86_64>metricbeat.exe modules disable redis mongodb Linux: [locationOfMetricBeat]$./metricbeat modules disable redis mongodb To enable dynamic config reloading, set reload.enabled to true and to specify the frequency to look for config file changes. Set the reload.period parameter under the metricbeat.config.modules property. For example: #metricbeat.yml metricbeat.config.modules: path: ${path.config}/modules.d/*.yml reload.enabled: true reload.period: 20s Enabling module config in the metricbeat.yml file If one is used to earlier versions of Metricbeat, one can enable the modules and metricsets in the metricbeat.yml file directly by adding entries to the metricbeat.modules list. Each entry in the list begins with a dash (-) and is followed by the settings for that module. For Example: metricbeat.modules: #------------------ Memcached Module ----------------------------- - module: memcached metricsets: ["stats"] period: 10s hosts: ["localhost:11211"] #------------------- MongoDB Module ------------------------------ - module: mongodb metricsets: ["dbstats", "status"] period: 5s It is possible to specify the module multiple times and specify a different period to use for one or more metricset. For example: #------- Couchbase Module ----------------------------- - module: couchbase metricsets: ["bucket"] period: 15s hosts: ["localhost:8091"] - module: couchbase metricsets: ["cluster", "node"] period: 30s hosts: ["localhost:8091"] General settings This section contains configuration options and some general settings to control the behavior of Metricbeat. Some of the configuration options/settings are: name: The name of the shipper that publishes the network data. By default, hostname is used for this field: name: "dc1-host1" tags: The list of tags that will be included in the tags field of every event Metricbeat ships. Tags make it easy to group servers by different logical properties and help when filtering events in Kibana and Logstash: tags: ["staging", "web-tier","dc1"] max_procs: The maximum number of CPUs that can be executing simultaneously. The default is the number of logical CPUs available in the System: max_procs: 2 Output configuration This section is used to configure outputs where the events need to be shipped. Events can be sent to single or multiple outputs simultaneously. The allowed outputs are Elasticsearch, Logstash, Kafka, Redis, file, and console. Some of the outputs that can be configured are as follows: elasticsearch: It is used to send the events directly to Elasticsearch. A sample Elasticsearch output configuration is shown in the following code snippet: output.elasticsearch: enabled: true hosts: ["localhost:9200"] Using the enabled setting, one can enable or disable the output. hosts accepts one or more Elasticsearch node/server. Multiple hosts can be defined for failover purposes. When multiple hosts are configured, the events are distributed to these nodes in round robin order. If Elasticsearch is secured, then the credentials can be passed using the username and password settings: output.elasticsearch: enabled: true hosts: ["localhost:9200"] username: "elasticuser" password: "password" To ship the events to the Elasticsearch ingest node pipeline so that they can be pre-processed before being stored in Elasticsearch, the pipeline information can be provided using the pipleline setting: output.elasticsearch: enabled: true hosts: ["localhost:9200"] pipeline: "ngnix_log_pipeline" The default index the data gets written to is of the format metricbeat-%{[beat.version]}-%{+yyyy.MM.dd}. This will create a new index every day. For example if today is December 2, 2017 then all the events are placed in the metricbeat-6.0.0-2017-12-02 index. One can override the index name or the pattern using the index setting. In the following configuration snippet, a new index is created for every month: output.elasticsearch: hosts: ["http://localhost:9200"] index: "metricbeat-%{[beat.version]}-%{+yyyy.MM}" Using the indices setting, one can conditionally place the events in the appropriate index that matches the specified condition. In the following code snippet, if the message contains the DEBUG string, it will be placed in the debug-%{+yyyy.MM.dd} index. If the message contains the ERR string, it will be placed in the error-%{+yyyy.MM.dd} index. If the message contains neither of these texts, then those events will be pushed to the logs-%{+yyyy.MM.dd} index as specified in the index parameter: output.elasticsearch: hosts: ["http://localhost:9200"] index: "logs-%{+yyyy.MM.dd}" indices: - index: "debug-%{+yyyy.MM.dd}" when.contains: message: "DEBUG" - index: "error-%{+yyyy.MM.dd}" when.contains: message: "ERR" When the index parameter is overridden, disable templates and dashboards by adding the following setting in: setup.dashboards.enabled: false setup.template.enabled: false Alternatively, provide the value for setup.template.name and setup.template.pattern in the metricbeat.yml configuration file, or else Metricbeat will fail to run. logstash: It is used to send the events to Logstash. To use Logstash as the output, Logstash needs to be configured with the Beats input plugin to receive incoming Beats events. A sample Logstash output configuration is as follows: output.logstash: enabled: true hosts: ["localhost:5044"] Using the enabled setting, one can enable or disable the output. hosts accepts one or more Logstash servers. Multiple hosts can be defined for failover purposes. If the configured host is unresponsive, then the event will be sent to one of the other configured hosts. When multiple hosts are configured, the events are distributed in random order. To enable load balancing of events across the Logstash hosts, use the loadbalance flag, set to true: output.logstash: hosts: ["localhost:5045", "localhost:5046"] loadbalance: true console: It is used to send the events to stdout. The events are written in JSON format. It is useful during debugging or testing. A sample console configuration is as follows: output.console: enabled: true pretty: true Logging This section contains the options for configuring the Filebeat logging output. The logging system can write logs to syslog or rotate log files. If logging is not explicitly configured, file output is used on Windows systems, and syslog output is used on Linux and OS X. A sample configuration is as follows: logging.level: debug logging.to_files: true logging.files: path: C:logsmetricbeat name: metricbeat.log keepfiles: 10 Some of the configuration options are: level: To specify the logging level. to_files: To write all logging output to files. The files are subject to file rotation. This is the default value. to_syslog: To write the logging output to syslogs if this setting is set to true. files.path, files.name, and files.keepfiles: These are used to specify the location of the file, the name We successfully configured Beat Library, MetricBeat and developed good transmission of operational metrics to Elasticsearch, making it easy to monitor systems and services on servers with much ease. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 to examine the fundamentals of Elastic Stack in detail and start developing solutions for problems like logging, site search, app search, metrics and more.      
Read more
  • 0
  • 0
  • 20688

article-image-nips-2017-6-challenges-deep-learning-robotics-pieter-abbeel
Aaron Lazar
13 Dec 2017
10 min read
Save for later

NIPS 2017 Special: 6 Key Challenges in Deep Learning for Robotics by Pieter Abbeel

Aaron Lazar
13 Dec 2017
10 min read
Pieter Abbeel is a professor at UC Berkeley and a former Research Scientist at OpenAI. His current research focuses on robotics and machine learning with particular focus on deep reinforcement learning, deep imitation learning, deep unsupervised learning, meta-learning, learning-to-learn, and AI safety. This article attempts to bring our readers to Pieter’s fantastic Keynote speech at NIPS 2017. It talks about the implementation of Deep Reinforcement Learning in Robotics, what challenges exist and how these challenges can be overcome. Once you’ve been through this article, we’re certain you’d be extremely interested in watching the entire video on the NIPS Facebook page. All images in this article come from his presentation slides and do not belong to us. Robotics in ML has been growing in leaps and bounds with several companies investing huge amounts to tie both these technologies together in the best way possible. Although, there are still several aspects that are not thoroughly accomplished when it comes to AI Robotics. Here are a few of them: Maximize Signal Extracted from Real World Experience Faster/Data efficient Reinforcement Learning Long Horizon Reasoning Taskability (Imitation Learning) Lifelong Learning (Continuous Adaptation) Leverage Simulation Maximise signal extracted from real world experience We need more real world data, so we need to extract as much signal from it. In the diagram below, are the different layers of machine learning that engineers perform. There are engineers who look at the entire cake and train the agent to take both the learning from the reward and from auxiliary signals. This is because using only Reinforcement Learning doesn’t give you a lot of signal. Is there then, a possibility of having a Reward Signal in RL that ties more RL into the system? There’s something known as Hindsight Experience Replay. The idea is to get a reward signal from any experience by assuming the goal equals whatever happened, and not just from success like in usual RL. For this, we need to assume that whatever the agent does is a success. We use Q-learning and instead of a standard Q function, we use multiple goals even though they were not really a goal when you were acting.  Here, a replay buffer collects experience, Q-learning is then applied and a hindsight replay is performed to infuse a new reward for everything the agent has done. For various robotic tasks like pushing, sliding and picking and placing objects, this does very well. Faster Reinforcement Learning When we’re talking about faster RL, we’re talking about much more data efficient RL. Here is a diagram that demonstrates standard RL: An agent lets a robot perform an action in a particular environment or situation in order to achieve a reward. Here, the goal is to maximise the reward. As against Supervised Learning, there is no supervision as to whether the actions taken by the agents are right or wrong. That brings in a few additional challenges in RL, which are: Credit assignment: This is a major problem and is where you get the signal from in RL Stability: Because of the feedback loop, the system could destabilize and destroy itself Exploration: Doing things you’ve never done before when the only way to learn is based on what you’ve done before Despite this, there have been great improvements in Reinforcement Learning in the past few years, enabling AI systems to play games like Go, Dota, etc. It has also been implemented in building robots by NASA for planetary exploration. But the question still exists: “How good is learning?” In the game of pong, a human takes roughly 2 hours to learn what Deep Q-Network (DQN) learns in 40 hours! A more careful study reveals that after 15 minutes, humans tend to outperform DDQN that has trained for 115 hours. This is a tremendous gap in terms of learning efficiency. So, how do we overcome the challenge? Several fully generalised algorithms like Trust Region Policy Optimization (TRPO), DQN, Asynchronous Actor-Critic Agents (A3C) and Rainbow are available, meaning that they can be applied to any kind of environment. Although, only a very small subset of environments are actually encountered in the real world. Can we develop fast RL algorithms that take advantage of this situation? RL Agents can be reused to train various policies. The RL algorithm is developed to train the policy to adapt to a particular environment A. This can then be replicated to environment B and so on. Humans develop the RL algorithm and then rely on it to train the policy. Despite this, none of the algorithms are as good as human learners. Do we have an alternative then? Indeed, yes! Why not let the system learn not just the policy but the algorithm as well or in other words, the entire agent? Enter Meta-Reinforcement Learning In Meta-RL, the learning algorithm itself is being learnt. You could relate this to meta-programming, where one program is trained to write another. This process helps a system learn the world better so it can pick up on learning a new situation quicker. So how does this work? The system is faced with many environments, so that it learns the algorithms and then outputs a faster RL Agent. So, when faced with a new environment, it quickly adapts to it. For evaluating the actual performance, the Multi-armed bandits problem can be considered. Here’s the setting: each bandit has its own distribution over payouts, and in each episode you can choose one bandit. A good RL agent should be able to explore a sufficient number of bandits and exploit the best ones. We need to come up with an algorithm that pulls a higher probability of payoff, rather than a low probability. There are already several asymptotically optimal algorithms like Gittins index, UCB1, Thompson Sampling, that have been created to solve this problem. Here’s a comparison of some of them with the Meta-RL algorithm. The result is quite impressive. The Meta-RL algorithm is equally competitive with Gittins. In a case where the task is to obtain an on target running direction as well as attain the maximum speed, the agent when dropped into an environment is able to master the the task almost instantly. However, meta-learning succeeds only 2/3rd of the time. It doesn’t succeed the rest of the time due to two main reasons. Overfitting: You would usually tend to overfit to the current situation rather than generically fitting to situations Underfitting: This is when you don’t get enough signal to get any rewards The solution is to put a different structure underneath the system. Instead of using an RNN, we use a wavenet like architecture or maybe Simple Neural Attentive Meta-Learner (SNAIL). SNAIL is able to perform a bit better than RL2 in the same Bandits problem. Longer Horizon Reasoning We need to learn to reason over longer horizons than what canonical algorithms do. For this, we need hierarchy. For example, suppose a robot has to perform 10 tasks in a day. This would mean it has 10 timesteps per day? Each of these 10 tasks would have subtasks under them. Let’s assume that would make it a total of 1000 time steps. To perform these tasks, the robot would need footstep planning, which would amount to 100,000 time steps. Footsteps in turn require commands to be sent to motors, which would make it 100,000,000 time steps. This is a very long horizon. We can formulate this as a meta-learning problem. The agent has to solve a distribution of related long-horizon tasks with the goal of learning new tasks in the distribution quickly. If that is our objective, hierarchy would fall out. Taskability (Imitation Learning) There are several things we want from robots. We need to be able to tell them what to do and we can do this by giving them examples. This is called Imitation Learning, which can be successfully implemented to a variety of use cases. The idea is to collect many demonstrations, then train something from those demonstrations, then deploy the learn policy. The problem with this is that everytime there is a new task, you start from scratch. The solution to this problem is experience through several demonstrations, as in the case of humans. Although, instead of running the agent through several demos, it is trained completely on one, then showed a frame of a second demo, where it uses it to predict what the outcome would be. This is known as One-Shot imitation learning which is a part of supervised learning, where in several demonstrations are used to train the system to be able to handle any new environment it is put into. Lifelong learning (Continuous Adaptation) What we usually do in ML can be divided into two broad steps: Run Machine Learning Deploy it, which is a canonical way In this case, all the learning happens ahead of time, before the deployment. However, in real world cases, what you learn from past data might not work in the future. There is a necessity to learn during deployment, which is a lifelong learning spirit. This brings us to Continuous Adaptation. Can we train an agent to be good at non stationary environments? We need to find whether at the time of meta training the agent is able to adapt to a new/changing task. We can try changing the dynamics since it’s hard to do ML training in the real world. At the same time, we can also use competitor environments; which means you’re in an environment with other agents who are trying to beat your agent. The only way to succeed is to continuously adapt more quickly than the others. Leverage Simulation Simulation is very helpful and it’s not that expensive. It’s fast and scalable and lets you label more easily. However, the challenge is how to get useful things out of the simulator. One approach is to build realistic simulators. This is quite expensive. Another way is to use a close enough simulator that uses real world data through domain confusion or adaptation. It allows to learn from a small amount of real world data and is quite successful. Further, another approach to look at is Domain Randomisation, which is also working well in the real world. If the model sees enough simulated variations, the real world might appear like just the next simulator. This has worked in the context of using simulator data to train a quadcopter to avoid collision. Moreover, when pre trained from imagenet or just training in simulation, both performances were similar, after around 8000 examples. To conclude, the beauty of meta learning is that it enables the discovery of algorithms that are data driven, as against those that are created from pure human ingenuity. This requires more compute power, but several companies like Nvidia and Intel are working hard to overcome this challenge. This will surely power meta-learning to great heights to be implemented in robotics. While we figure out these above mentioned technical challenges of incorporating AI in robotics, some significant other challenges that we must focus on in parallel are safe learning, and value alignment among others.    
Read more
  • 0
  • 0
  • 20652
Modal Close icon
Modal Close icon