Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-top-10-deep-learning-frameworks
Amey Varangaonkar
25 May 2017
9 min read
Save for later

Top 10 deep learning frameworks

Amey Varangaonkar
25 May 2017
9 min read
Deep learning frameworks are powering the artificial intelligence revolution. Without them, it would be almost impossible for data scientists to deliver the level of sophistication in their deep learning algorithms that advances in computing and processing power have made possible. Put simply, deep learning frameworks make it easier to build deep learning algorithms of considerable complexity. This follows a wider trend that you can see in other fields of programming and software engineering; open source communities are continually are developing new tools that simplify difficult tasks and minimize arduous ones. The deep learning framework you choose to use is ultimately down to what you're trying to do and how you work already. But to get you started here is a list of 10 of the best and most popular deep learning frameworks being used today. What are the best deep learning frameworks? Tensorflow One of the most popular Deep Learning libraries out there, Tensorflow, was developed by the Google Brain team and open-sourced in 2015. Positioned as a ‘second-generation machine learning system’, Tensorflow is a Python-based library capable of running on multiple CPUs and GPUs. It is available on all platforms, desktop, and mobile. It also has support for other languages such as C++ and R and can be used directly to create deep learning models, or by using wrapper libraries (for e.g. Keras) on top of it. In November 2017, Tensorflow announced a developer preview for Tensorflow Lite, a lightweight machine learning solution for mobile and embedded devices. The machine learning paradigm is continuously evolving - and the focus is now slowly shifting towards developing machine learning models that run on mobile and portable devices in order to make the applications smarter and more intelligent. Learn how to build a neural network with TensorFlow. If you're just starting out with deep learning, TensorFlow is THE go-to framework. It’s Python-based, backed by Google, has a very good documentation, and there are tons of tutorials and videos available on the internet to guide you. You can check out Packt’s TensorFlow catalog here. Keras Although TensorFlow is a very good deep learning library, creating models using only Tensorflow can be a challenge, as it is a pretty low-level library and can be quite complex to use for a beginner. To tackle this challenge, Keras was built as a simplified interface for building efficient neural networks in just a few lines of code and it can be configured to work on top of TensorFlow. Written in Python, Keras is very lightweight, easy to use, and pretty straightforward to learn. Because of these reasons, Tensorflow has incorporated Keras as part of its core API. Despite being a relatively new library, Keras has a very good documentation in place. If you want to know more about how Keras solves your deep learning problems, this interview by our best-selling author Sujit Pal should help you. Read now: Why you should use Keras for deep learning [box type="success" align="" class="" width=""]If you have some knowledge of Python programming and want to get started with deep learning, this is one library you definitely want to check out![/box] Caffe Built with expression, speed, and modularity in mind, Caffe is one of the first deep learning libraries developed mainly by Berkeley Vision and Learning Center (BVLC). It is a C++ library which also has a Python interface and finds its primary application in modeling Convolutional Neural Networks. One of the major benefits of using this library is that you can get a number of pre-trained networks directly from the Caffe Model Zoo, available for immediate use. If you’re interested in modeling CNNs or solve your image processing problems, you might want to consider this library. Following the footsteps of Caffe, Facebook also recently open-sourced Caffe2, a new light-weight, modular deep learning framework which offers greater flexibility for building high-performance deep learning models. Torch Torch is a Lua-based deep learning framework and has been used and developed by big players such as Facebook, Twitter and Google. It makes use of the C/C++ libraries as well as CUDA for GPU processing.  Torch was built with an aim to achieve maximum flexibility and make the process of building your models extremely simple. More recently, the Python implementation of Torch, called PyTorch, has found popularity and is gaining rapid adoption. PyTorch PyTorch is a Python package for building deep neural networks and performing complex tensor computations. While Torch uses Lua, PyTorch leverages the rising popularity of Python, to allow anyone with some basic Python programming language to get started with deep learning. PyTorch improves upon Torch’s architectural style and does not have any support for containers - which makes the entire deep modeling process easier and transparent to you. Still wondering how PyTorch and Torch are different from each other? Make sure you check out this interesting post on Quora. Deeplearning4j DeepLearning4j (or DL4J) is a popular deep learning framework developed in Java and supports other JVM languages as well. It is very slick and is very widely used as a commercial, industry-focused distributed deep learning platform. The advantage of using DL4j is that you can bring together the power of the whole Java ecosystem to perform efficient deep learning, as it can be implemented on top of the popular Big Data tools such as Apache Hadoop and Apache Spark. [box type="success" align="" class="" width=""]If Java is your programming language of choice, then you should definitely check out this framework. It is clean, enterprise-ready, and highly effective. If you’re planning to deploy your deep learning models to production, this tool can certainly be of great worth![/box] MXNet MXNet is one of the most languages-supported deep learning frameworks, with support for languages such as R, Python, C++ and Julia. This is helpful because if you know any of these languages, you won’t need to step out of your comfort zone at all, to train your deep learning models. Its backend is written in C++ and cuda, and is able to manage its own memory like Theano. MXNet is also popular because it scales very well and is able to work with multiple GPUs and computers, which makes it very useful for the enterprises. This is also one of the reasons why Amazon made MXNet its reference library for Deep Learning too. In November, AWS announced the availability of ONNX-MXNet, which is an open source Python package to import ONNX (Open Neural Network Exchange) deep learning models into Apache MXNet. Read why MXNet is a versatile deep learning framework here. Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit, previously known by its acronym CNTK, is an open-source deep learning toolkit to train deep learning models. It is highly optimized and has support for languages such as Python and C++. Known for its efficient resource utilization, you can easily implement efficient Reinforcement Learning models or Generative Adversarial Networks (GANs) using the Cognitive Toolkit. It is designed to achieve high scalability and performance and is known to provide high-performance gains when compared to other toolkits like Theano and Tensorflow when running on multiple machines. Here is a fun comparison of TensorFlow versus CNTK, if you would like to know more. deeplearn.js Gone are the days when you required serious hardware to run your complex machine learning models. With deeplearn.js, you can now train neural network models right on your browser! Originally developed by the Google Brain team, deeplearn.js is an open-source, JavaScript-based deep learning library which runs on both WebGL 1.0 and WebGL 2.0. deeplearn.js is being used today for a variety of purposes - from education and research to training high-performance deep learning models. You can also run your pre-trained models on the browser using this library. BigDL BigDL is distributed deep learning library for Apache Spark and is designed to scale very well. With the help of BigDL, you can run your deep learning applications directly on Spark or Hadoop clusters, by writing them as Spark programs. It has a rich deep learning support and uses Intel’s Math Kernel Library (MKL) to ensure high performance. Using BigDL, you can also load your pre-trained Torch or Caffe models into Spark. If you want to add deep learning functionalities to a massive set of data stored on your cluster, this is a very good library to use. [box type="shadow" align="" class="" width=""]Editor's Note: We have removed Theano and Lasagne from the original list due to the Theano retirement announcement. RIP Theano! Before Tensorflow, Caffe or PyTorch came to be, Theano was the most widely used library for deep learning. While it was a low-level library supporting CPU as well as GPU computations, you could wrap it with libraries like Keras to simplify the deep learning process. With the release of version 1.0, it was announced that the future development and support for Theano would be stopped. There would be minimal maintenance to keep it working for the next one year, after which even the support activities on the library would be suspended completely. “Supporting Theano is no longer the best way we can enable the emergence and application of novel research ideas”, said Prof. Yoshua Bengio, one of the main developers of Theano. Thank you Theano, you will be missed! Goodbye Lasagne Lasagne is a high-level deep learning library that runs on top of Theano.  It has been around for quite some time now and was developed with the aim of abstracting the complexities of Theano, and provide a more friendly interface to the users to build and train neural networks. It requires Python and finds many similarities to Keras, which we just saw above. However, if we are to find differences between the two, Keras is faster and has a better documentation in place.[/box] There are many other deep learning libraries and frameworks available for use today – DSSTNE, Apache Singa, Veles are just a few worth an honorable mention. Which deep learning frameworks will best suit your needs? Ultimately, it depends on a number of factors. If you want to get started with deep learning, your safest bet would be to use a Python-based framework like Tensorflow, which are quite popular. For seasoned professionals, the efficiency of the trained model, ease of use, speed and resource utilization are all important considerations for choosing the best deep learning framework.
Read more
  • 0
  • 0
  • 63122

article-image-introduction-titanic-datasets
Packt
09 May 2017
11 min read
Save for later

Introduction to Titanic Datasets

Packt
09 May 2017
11 min read
In this article by Alexis Perrier, author of the book Effective Amazon Machine Learning says artificial intelligence and big data have become a ubiquitous part of our everyday lives; cloud-based machine learning services are part of a rising billion-dollar industry. Among the several such services currently available on the market, Amazon Machine Learning stands out for its simplicity. Amazon Machine Learning was launched in April 2015 with a clear goal of lowering the barrier to predictive analytics by offering a service accessible to companies without the need for highly skilled technical resources. (For more resources related to this topic, see here.) Working with datasets You cannot do predictive analytics without a dataset. Although we are surrounded by data, finding datasets that are adapted to predictive analytics is not always straightforward. In this section, we present some resources that are freely available. The Titanic datasetis a classic introductory datasets for predictive analytics. Finding open datasets There is a multitude of dataset repositories available online, from local to global public institutions to non-profit and data-focused start-ups. Here’s a small list of open dataset resources that are well suited forpredictive analytics. This, by far, is not an exhaustive list. This thread on Quora points to many other interesting data sources:https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public.You can also ask for specific datasets on Reddit at https://www.reddit.com/r/datasets/. The UCI Machine Learning Repository is a collection of datasets maintained by UC Irvine since 1987, hosting over 300 datasets related to classification, clustering, regression, and other ML tasks Mldata.org from the University of Berlinor the Stanford Large Network Dataset Collection and other major universities alsooffer great collections of open datasets Kdnuggets.com has an extensive list of open datasets at http://www.kdnuggets.com/datasets Data.gov and other US government agencies;data.UN.org and other UN agencies AWS offers open datasets via partners at https://aws.amazon.com/government-education/open-data/. The following startups are data centered and give open access to rich data repositories: Quandl and quantopian for financial datasets Datahub.io, Enigma.com, and Data.world are dataset-sharing sites Datamarket.com is great for time series datasets Kaggle.com, the data science competition website, hosts over 100 very interesting datasets AWS public datasets:AWS hosts a variety of public datasets,such as the Million Song Dataset, the mapping of the Human Genome, the US Census data as well as many others in Astrology, Biology, Math, Economics, and so on. These datasets are mostly available via EBS snapshots although some are directly accessible on S3. The datasets are large, from a few gigabytes to several terabytes, and are not meant to be downloaded on your local machine; they are only to be accessible via an EC2 instance (take a look at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-public-data-sets.htmlfor further details).AWS public datasets are accessible at https://aws.amazon.com/public-datasets/. Introducing the Titanic dataset We will use the classic Titanic dataset. The dataconsists of demographic and traveling information for1,309 of the Titanic passengers, and the goal isto predict the survival of these passengers. The full Titanic dataset is available from the Department of Biostatistics at the Vanderbilt University School of Medicine (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv)in several formats. The Encyclopedia Titanica website (https://www.encyclopedia-titanica.org/) is the website of reference regarding the Titanic. It contains all the facts, history, and data surrounding the Titanic, including a full list of passengers and crew members. The Titanic datasetis also the subject of the introductory competition on Kaggle.com (https://www.kaggle.com/c/titanic, requires opening an account with Kaggle). You can also find a csv version in GitHub repository at https://github.com/alexperrier/packt-aml/blob/master/ch4. The Titanic data containsa mix of textual, Boolean, continuous, and categorical variables. It exhibits interesting characteristics such as missing values, outliers, and text variables ripe for text mining--a rich database that will allow us to demonstrate data transformations. Here’s a brief summary of the 14attributes: pclass: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd) survival: A Boolean indicating whether the passenger survived or not (0 = No; 1 = Yes); this is our target name: A field rich in information as it contains title and family names sex: male/female age: Age, asignificant portion of values aremissing sibsp: Number of siblings/spouses aboard parch: Number of parents/children aboard ticket: Ticket number. fare: Passenger fare (British Pound). cabin: Doesthe location of the cabin influence chances of survival? embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) boat: Lifeboat, many missing values body: Body Identification Number home.dest: Home/destination Take a look at http://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/TitanicMETA.pdf for more details on these variables. We have 1,309 records and 14 attributes, three of which we will discard. The home.dest attribute hastoo few existing values, the boat attribute is only present for passengers who have survived, and thebody attributeis only for passengers who have not survived. We will discard these three columnslater on while using the data schema. Preparing the data Now that we have the initial raw dataset, we are going to shuffle it, split it into a training and a held-out subset, and load it to an S3 bucket. Splitting the data In order to build and select the best model, we need to split the dataset into three parts: training, validation, and test, with the usual ratios being 60%, 20%, and 20%. The training and validation sets are used to build several models and select the best one while the test or held-out set, is used for the final performance evaluation on previously unseen data. Since Amazon ML does the job of splitting the dataset used for model training and model evaluation into a training and a validation subsets, we only need to split our initial dataset into two parts: the global training/evaluation subset (80%) for model building and selection, and the held-out subset (20%) for predictions and final model performance evaluation. Shuffle before you split:If you download the original data from the Vanderbilt University website,you will notice that it is ordered by pclass, the class of the passenger and by alphabetical order of the name column. The first 323 rows correspond to the 1st class followed by 2nd (277) and 3rd (709) class passengers. It is important to shuffle the data before you split it so that all the different variables have have similar distributions in each training and held-out subsets. You can shuffle the data directly in the spreadsheet by creating a new column, generating a random number for each row and then ordering by that column. On GitHub: You will find an already shuffledtitanic.csv file at https://github.com/alexperrier/packt-aml/blob/master/ch4/titanic.csv. In addition to shuffling the data, we have removed punctuation in the name column: commas, quotes, and parenthesis, which can add confusion when parsing a csv file. We end up with two files:titanic_train.csv with 1047 rows and titanic_heldout.csv with 263rows. These files are also available in the GitHub repo (https://github.com/alexperrier/packt-aml/blob/master/ch4). The next step is to upload these files on S3 so that Amazon ML can access them. Loading data on S3 AWS S3 is one of the main AWS services dedicated to hosting files and managing their access. Files in S3 can be public and open to the internet or have access restricted to specific users, roles, or services.S3 is also used extensively by AWS for operations such as storing log files or results (predictions, scripts, queries, and so on). Files in S3 are organized around the notion of buckets. Buckets are placeholders with unique names similar to domain names for websites. A file in S3 will have a unique locator URI: s3://bucket_name/{path_of_folders}/filename. The bucket name is unique across S3. In this section, we will create a bucket for our data, upload the titanic training file, and open its access to Amazon ML. Go to https://console.aws.amazon.com/s3/home, and open an S3 account if you don’t have one yet. S3 pricing:S3 charges for the total volume of files you host and the volume of file transfers depends on the region where the files are hosted. At time of writing, for less than 1TB, AWS S3 charges $0.03/GB per month in the US east region. All S3 prices are available at https://aws.amazon.com/s3/pricing/. See also http://calculator.s3.amazonaws.com/index.htmlfor the AWS cost calculator. Creating a bucket Once you have created your S3 account, the next step is to create a bucket for your files.Click on the Create bucket button: Choose a name and a region, since bucket names are unique across S3, you must choose a name for your bucket that has not been already taken. We chose the name aml.packt for our bucket, and we will use this bucket throughout. Regarding the region, you should always select a region that is the closest to the person or application accessing the files in order to reduce latency and prices. Set Versioning, Logging, and Tags, versioning will keep a copy of every version of your files, which prevents from accidental deletions. Since versioning and logging induce extra costs, we chose to disable them. Set permissions. Review and save. Loading the data To upload the data, simply click on the upload button and select the titanic_train.csv file we created earlier on. You should, at this point, have the training dataset uploaded to your AWS S3 bucket. We added a/data folder in our aml.packt bucket to compartmentalize our objects. It will be useful later on when the bucket will also contain folders created by S3. At this point, only the owner of the bucket (you) is able to access and modify its contents. We need to grant the Amazon ML service permissions to read the data and add other files to the bucket. When creating the Amazon ML datasource, we will be prompted to grant these permissions inthe Amazon ML console. We can also modify the bucket’s policy upfront. Granting permissions We need to edit the policy of the aml.packt bucket. To do so, we have to perform the following steps: Click into your bucket. Select the Permissions tab. In the drop down, select Bucket Policy as shown in the following screenshot. This will open an editor: Paste in the following JSON. Make sure to replace {YOUR_BUCKET_NAME} with the name of your bucket and save: { “Version”: “2012-10-17”, “Statement”: [ { “Sid”: “AmazonML_s3:ListBucket”, “Effect”: “Allow”, “Principal”: { “Service”: “machinelearning.amazonaws.com” }, “Action”: “s3:ListBucket”, “Resource”: “arn:aws:s3:::{YOUR_BUCKET_NAME}”, “Condition”: { “StringLike”: { “s3:prefix”: “*” } } }, { “Sid”: “AmazonML_s3:GetObject”, “Effect”: “Allow”, “Principal”: { “Service”: “machinelearning.amazonaws.com” }, “Action”: “s3:GetObject”, “Resource”: “arn:aws:s3:::{YOUR_BUCKET_NAME}/*” }, { “Sid”: “AmazonML_s3:PutObject”, “Effect”: “Allow”, “Principal”: { “Service”: “machinelearning.amazonaws.com” }, “Action”: “s3:PutObject”, “Resource”: “arn:aws:s3:::{YOUR_BUCKET_NAME}/*” } ] } Further details on this policy are available at http://docs.aws.amazon.com/machine-learning/latest/dg/granting-amazon-ml-permissions-to-read-your-data-from-amazon-s3.html. Once again, this step is optional since Amazon ML will prompt you for access to the bucket when you create the datasource. Formatting the data Amazon ML works on comma separated values files (.csv)--a very simple format where each rowis an observation and each column is a variable or attribute. There are, however, a few conditionsthat shouldbe met: The data must be encoded in plain text using a character set, such asASCII, Unicode, or EBCDIC All values must be separated by commas; if a value contains a comma, it should be enclosed by double quotes Each observation (row) must be smaller than 100k There are also conditions regarding end of line characters that separate rows. Special care must be taken when using Excel on OS X (Mac) as explained on this page: http://docs.aws.amazon.com/machine-learning/latest/dg/understanding-the-data-format-for-amazon-ml.html What about other data file formats? Unfortunately, Amazon ML datasource are only compatible with csv files and Redshift databases and does not accept formats such as JSON, TSV, or XML. However, other services such as Athena, a serverless database service, do accept a wider range of formats. Summary In this article we learnt about how to use and work around with datasets using Amazon web services and Titanic datasets. We also learnt how prepare data and Amazon S3 services.  Resources for Article: Further resources on this subject: Processing Massive Datasets with Parallel Streams – the MapReduce Model [article] Processing Next-generation Sequencing Datasets Using Python [article] Combining Vector and Raster Datasets [article]
Read more
  • 0
  • 1
  • 17111

article-image-supervised-learning-classification-and-regression
Packt
06 Apr 2017
30 min read
Save for later

Supervised Learning: Classification and Regression

Packt
06 Apr 2017
30 min read
In this article by Alexey Grigorev, author of the book Mastering Java for Data Science, we will look at how to do pre-processing of data in Java and how to do Exploratory Data Analysis inside and outside Java. Now, when we covered the foundation, we are ready to start creating Machine Learning models. First, we start with Supervised Learning. In the supervised settings we have some information attached to each observation – called labels – and we want to learn from it, and predict it for observations without labels. There are two types of labels: the first are discrete and finite, such as True/False or Buy/Sell, and second are continuous, such as salary or temperature. These types correspond to two types of Supervised Learning: Classification and Regression. We will talk about them in this article. This article covers: Classification problems Regression Problems Evaluation metrics for each type Overview of available implementations in Java (For more resources related to this topic, see here.) Classification In Machine Learning, the classification problem deals with discrete targets with finite set of possible values. The Binary Classification is the most common type of classification problems: the target variable can have only two possible values, such as True/False, Relevant/Not Relevant, Duplicate/Not Duplicate, Cat/Dog and so on. Sometimes the target variable can have more than two outcomes, for example: colors, category of an item, model of a car, and so on, and we call it Multi-Class Classification. Typically each observation can only have one label, but in some settings an observation can be assigned several values. Typically, Multi-Class Classification can be converted to a set of Binary Classification problems, which is why we will mostly concentrate on Binary Classification. Binary Classification Models There are many models for solving the binary classification problem and it is not possible to cover all of them. We will briefly cover the ones that are most often used in practice. They include: Logistic Regression Support Vector Machines Decision Trees Neural Networks We assume that you are already familiar with these methods. Deep familiarity is not required, but for more information you can check the following book: An Introduction to Statistical Learning by G. James, D. Witten, T. Hastie and R. Tibshirani Python Machine Learning by S. Raschka When it comes to libraries, we will cover the following: Smile, JSAT, LIBSVM and LIBLINEAR and Encog. Smile Smile (Statistical Machine Intelligence and Learning Engine) is a library with a large set of classification and other machine learning algorithms. You can see the full list here: https://github.com/haifengl/smile. The library is available on Maven Central and the latest version at the moment of writing is 1.1.0. To include it to your project, add the following dependency: <dependency> <groupId>com.github.haifengl</groupId> <artifactId>smile-core</artifactId> <version>1.1.0</version> </dependency> It is being actively developed; new features and bug fixes are added quite often, but not released as frequently. We recommend to use the latest available version of Smile, and to get it, you will need to build it from the sources. To do it: Install sbt – a tool for building scala projects. You can follow this instruction: http://www.scala-sbt.org/release/docs/Manual-Installation.html Use git to clone the project from https://github.com/haifengl/smile To build and publish the library to local maven repository, run the following command: sbt core/publishM2 The Smile library consists of several sub-modules, such as smile-core, smile-nlp, and smile-plot and so on. So, after building it, add the following dependency to your pom: <dependency> <groupId>com.github.haifengl</groupId> <artifactId>smile-core</artifactId> <version>1.2.0</version> </dependency> The models from Smile expect the data to be in a form of two-dimensional arrays of doubles, and the label information is one dimensional array of integers. For binary models, the values should be 0 or 1. Some models in Smile can handle Multi-Class Classification problems, so it is possible to have more labels. The models are built using the Builder pattern: you create a special class, set some parameters and at the end it returns the object it builds. In Smile this builder class is typically called Trainer, and all models should have a trainer for them. For example, consider training a Random Forest model: double[] X = ...// training data int[] y = ...// 0 and 1 labels RandomForest model = new RandomForest.Trainer() .setNumTrees(100) .setNodeSize(4) .setSamplingRates(0.7) .setSplitRule(SplitRule.ENTROPY) .setNumRandomFeatures(3) .train(X, y); The RandomForest.Trainer class takes in a set of parameters and the training data, and at the end produces the trained Random Forest model. The implementation of Random Forest from Smile has the following parameters: numTrees: number of trees to train in the model. nodeSize: the minimum number of items in the leaf nodes. samplingRate: the ratio of training data used to grow each tree. splitRule: the impurity measure used for selecting the best split. numRandomFeatures: the number of features the model randomly chooses for selecting the best split. Similarly, a logistic regression is trained as follows: LogisticRegression lr = new LogisticRegression.Trainer() .setRegularizationFactor(lambda) .train(X, y); Once we have a model, we can use it for predicting the label of previously unseen items. For that we use the predict method: double[] row = // data int prediction = model.predict(row); This code outputs the most probable class for the given item. However, often we are more interested not in the label itself, but in the probability of having the label. If a model implements the SoftClassifier interface, then it is possible to get these probabilities like this: double[] probs = new double[2]; model.predict(row, probs); After running this code, the probs array will contain the probabilities. JSAT JSAT (Java Statistical Analysis Tool) is another Java library which contains a lot of implementations of common Machine Learning algorithms. You can check the full list of implemented models at https://github.com/EdwardRaff/JSAT/wiki/Algorithms. To include JSAT to a Java project, add this to pom: <dependency> <groupId>com.edwardraff</groupId> <artifactId>JSAT</artifactId> <version>0.0.5</version> </dependency> Unlike Smile, which just takes arrays of doubles, JSAT requires a special wrapper class for data instances. If we have an array, it is converted to the JSAT representation like this: double[][] X = ... // data int[] y = ... // labels // change to more classes for more classes for multi-classification CategoricalData binary = new CategoricalData(2); List<DataPointPair<Integer>> data = new ArrayList<>(X.length); for (int i = 0; i < X.length; i++) { int target = y[i]; DataPoint row = new DataPoint(new DenseVector(X[i])); data.add(new DataPointPair<Integer>(row, target)); } ClassificationDataSet dataset = new ClassificationDataSet(data, binary); Once we have prepared the dataset, we can train a model. Let us consider the Random Forest classifier again: RandomForest model = new RandomForest(); model.setFeatureSamples(4); model.setMaxForestSize(150); model.trainC(dataset); First, we set some parameters for the model, and then, at we end, we call the trainC method (which means “train a classifier”). In the JSAT implementation, Random Forest has fewer options for tuning: only the number of features to select and the number of trees to grow. There are several implementations of Logistic Regression. The usual Logistic Regression model does not have any parameters, and it is trained like this: LogisticRegression model = new LogisticRegression(); model.trainC(dataset); If we want to have a regularized model, then we need to use the LogisticRegressionDCD class (DCD stands for “Dual Coordinate Descent” - this is the optimization method used to train the logistic regression). We train it like this: LogisticRegressionDCD model = new LogisticRegressionDCD(); model.setMaxIterations(maxIterations); model.setC(C); model.trainC(fold.toJsatDataset()); In this code, C is the regularization parameter, and the smaller values of C correspond to stronger regularization effect. Finally, for outputting the probabilities, we can do the following: double[] row = // data DenseVector vector = new DenseVector(row); DataPoint point = new DataPoint(vector); CategoricalResults out = model.classify(point); double probability = out.getProb(1); The class CategoricalResults contains a lot of information, including probabilities for each class and the most likely label. LIBSVM and LIBLINEAR Next we consider two similar libraries: LIBSVM and LIBLINEAR. LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) is a library with implementation of Support Vector Machine models, which include support vector classifiers. LIBLINEAR (https://www.csie.ntu.edu.tw/~cjlin/liblinear/) is a library for fast linear classification algorithms such as Liner SVM and Logistic Regression. Both these libraries come from the same research group and have very similar interfaces. LIBSVM is implemented in C++ and has an officially supported java version. It is available on Maven Central: <dependency> <groupId>tw.edu.ntu.csie</groupId> <artifactId>libsvm</artifactId> <version>3.17</version> </dependency> Note that the Java version of LIBSVM is updated not as often as the C++ version. Nevertheless, the version above is stable and should not contain bugs, but it might be slower than its C++ version. To use SVM models from LIBSVM, you first need to specify the parameters. For this, you create a svm_parameter class. Inside, you can specify many parameters, including: the kernel type (RBF, POLY or LINEAR), the regularization parameter C, probability, which you can set to 1 to be able to get probabilities, the svm_type should be set to C_SVC: this tells that the model should be a classifier. Here is an example of how you can configure an SVM classifier with the Linear kernel which can output probabilities: svm_parameter param = new svm_parameter(); param.svm_type = svm_parameter.C_SVC; param.kernel_type = svm_parameter.LINEAR; param.probability = 1; param.C = C; // default parameters param.cache_size = 100; param.eps = 1e-3; param.p = 0.1; param.shrinking = 1; The polynomial kernel is specified by the following formula: It has three additional parameters: gamma, coeff0 and degree; and also C – the regularization parameter. You can specify it like this: svm_parameter param = new svm_parameter(); param.svm_type = svm_parameter.C_SVC; param.kernel_type = svm_parameter.POLY; param.C = C; param.degree = degree; param.gamma = 1; param.coef0 = 1; param.probability = 1; // plus defaults from the above Finally, the Gaussian kernel (or RBF) has the following formula:' So there is one parameter gamma, which controls the width of the Gaussians. We can specify the model with the RBF kernel like this: svm_parameter param = new svm_parameter(); param.svm_type = svm_parameter.C_SVC; param.kernel_type = svm_parameter.RBF; param.C = C; param.gamma = gamma; param.probability = 1; // plus defaults from the above Once we created the configuration object, we need to convert the data in the right format. The LIBSVM command line application reads files in the SVMLight format, so the library also expects sparse data representation. For a single array, the conversion is following: double[] dataRow = // single row vector svm_node[] svmRow = new svm_node[dataRow.length]; for (int j = 0; j < dataRow.length; j++) { svm_node node = new svm_node(); node.index = j; node.value = dataRow[j]; svmRow[j] = node; } For a matrix, we do this for every row: double[][] X = ... // data int n = X.length; svm_node[][] nodes = new svm_node[n][]; for (int i = 0; i < n; i++) { nodes[i] = wrapAsSvmNode(X[i]); } Where wrapAsSvmNode is a function, that wraps a vector into an array of svm_node objects. Now we can put the data and the labels together into svm_problem object: double[] y = ... // labels svm_problem prob = new svm_problem(); prob.l = n; prob.x = nodes; prob.y = y; Now using the data and the parameters we can train the SVM model: svm_model model = svm.svm_train(prob, param);  Once the model is trained, we can use it for classifying unseen data. Getting probabilities is done this way: double[][] X = // test data int n = X.length; double[] results = new double[n]; double[] probs = new double[2]; for (int i = 0; i < n; i++) { svm_node[] row = wrapAsSvmNode(X[i]); svm.svm_predict_probability(model, row, probs); results[i] = probs[1]; } Since we used the param.probability = 1, we can use svm.svm_predict_probability method to predict probabilities. Like Smile, the method takes an array of doubles and writes the results there. Then we can get the probabilities there. Finally, while training, LIBSVM outputs a lot of things on the console. If we are not interested in this output, we can disable it with the following code snippet: svm.svm_set_print_string_function(s -> {}); Just add this in the beginning of your code and you will not see this anymore. The next library is LIBLINEAR, which provides very fast and high performing linear classifiers such as SVM with Linear Kernel and Logistic Regression. It can easily scale to tens and hundreds of millions of data points. Unlike LIBSVM, there is no official Java version of LIBLINEAR, but there is unofficial Java port available at http://liblinear.bwaldvogel.de/. To use it, include the following: <dependency> <groupId>de.bwaldvogel</groupId> <artifactId>liblinear</artifactId> <version>1.95</version> </dependency> The interface is very similar to LIBSVM. First, you define the parameters: SolverType solverType = SolverType.L1R_LR; double C = 0.001; double eps = 0.0001; Parameter param = new Parameter(solverType, C, eps); We have to specify three parameters here: solverType: defines the model which will be used; C: the amount regularization, the smaller C the stronger the regularization; epsilon: tolerance for stopping the training process. A reasonable default is 0.0001; For classification there are the following solvers: Logistic Regression: L1R_LR or L2R_LR SVM: L1R_L2LOSS_SVC or L2R_L2LOSS_SVC According to the official FAQ (which can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/FAQ.html) you should: Prefer SVM to Logistic Regression as it trains faster and usually gives higher accuracy. Try L2 regularization first unless you need a sparse solution – in this case use L1 The default solver is L2-regularized support linear classifier for the dual problem. If it is slow, try solving the primal problem instead. Then you define the dataset. Like previously, let's first see how to wrap a row: double[] row = // data int m = row.length; Feature[] result = new Feature[m]; for (int i = 0; i < m; i++) { result[i] = new FeatureNode(i + 1, row[i]); } Please note that we add 1 to the index – the 0 is the bias term, so the actual features should start from 1. We can put this into a function wrapRow and then wrap the entire dataset as following: double[][] X = // data int n = X.length; Feature[][] matrix = new Feature[n][]; for (int i = 0; i < n; i++) { matrix[i] = wrapRow(X[i]); } Now we can create the Problem class with the data and labels: double[] y = // labels Problem problem = new Problem(); problem.x = wrapMatrix(X); problem.y = y; problem.n = X[0].length + 1; problem.l = X.length; Note that here we also need to provide the dimensionality of the data, and it's the number of features plus one. We need to add one because it includes the bias term. Now we are ready to train the model: Model model = LibLinear.train(fold, param); When the model is trained, we can use it to classify unseen data. In the following example we will output probabilities: double[] dataRow = // data Feature[] row = wrapRow(dataRow); Linear.predictProbability(model, row, probs); double result = probs[1]; The code above works fine for the Logistic Regression model, but it will not work for SVM: SVM cannot output probabilities, so the code above will throw an error for solvers like L1R_L2LOSS_SVC. What you can do instead is get the raw output: double[] values = new double[1]; Feature[] row = wrapRow(dataRow); Linear.predictValues(model, row, values); double result = values[0]; In this case the results will not contain probability, but some real value. When this value is greater than zero, the model predicts that the class is positive. If we would like to map this value to the [0, 1] range, we can use the sigmoid function for that: public static double[] sigmoid(double[] scores) { double[] result = new double[scores.length]; for (int i = 0; i < result.length; i++) { result[i] = 1 / (1 + Math.exp(-scores[i])); } return result; } Finally, like LIBSVM, LIBLINEAR also outputs a lot of things to standard output. If you do not wish to see it, you can mute it with the following code: PrintStream devNull = new PrintStream(new NullOutputStream()); Linear.setDebugOutput(devNull); Here, we use NullOutputStream from Apache IO which does nothing, so the screen stays clean. When to use LIBSVM and when to use LIBLINEAR? For large datasets often it is not possible to use any kernel methods. In this case you should prefer LIBLINEAR. Additionally, LIBLINEAR is especially good for Text Processing purposes such as Document Classification. Encog Finally, we consider a library for training Neural Networks: Encog. It is available on Maven Central and can be added with the following snippet: <dependency> <groupId>org.encog</groupId> <artifactId>encog-core</artifactId> <version>3.3.0</version> </dependency> With this library you first need to specify the network architecture: BasicNetwork network = new BasicNetwork(); network.addLayer(new BasicLayer(new ActivationSigmoid(), true, noInputNeurons)); network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 30)); network.addLayer(new BasicLayer(new ActivationSigmoid(), true, 1)); network.getStructure().finalizeStructure(); network.reset(); Here we create a network with one input layer, one inner layer with 30 neurons and one output layer with 1 neuron. In each layer we use sigmoid as the activation function and add the bias input (the true parameter). The last line randomly initializes the weights in the network. For both input and output Encog expects two-dimensional double arrays. In case of binary classification we typically have a one dimensional array, so we need to convert it: double[][] X = // data double[] y = // labels double[][] y2d = new double[y.length][]; for (int i = 0; i < y.length; i++) { y2d[i] = new double[] { y[i] }; } Once the data is converted, we wrap it into special wrapper class: MLDataSet dataset = new BasicMLDataSet(X, y2d); Then this dataset can be used for training: MLTrain trainer = new ResilientPropagation(network, dataset); double lambda = 0.01; trainer.addStrategy(new RegularizationStrategy(lambda)); int noEpochs = 101; for (int i = 0; i < noEpochs; i++) { trainer.iteration(); } There are a lot of other Machine Learning libraries available in Java. For example Weka, H2O, JavaML and others. It is not possible to cover all of them, but you can also try them and see if you like them more than the ones we have covered. Evaluation We have covered many Machine Learning libraries, and many of them implement the same algorithms like Random Forest or Logistic Regression. Also, each individual model can have many different parameters: a Logistic Regression has the regularization coefficient, an SVM is configured by setting the kernel and its parameters. How do we select the best single model out of so many possible variants? For that we first define some evaluation metric and then select the model which achieves the best possible performance with respect to this metric. For binary classification we can select one of the following metrics: Accuracy and Error Precision, Recall and F1 AUC (AU ROC) Result Evaluation K-Fold Cross Validation Training, Validation and Testing. Accuracy Accuracy tells us for how many examples the model predicted the correct label. Calculating it is trivial: int n = actual.length; double[] proba = // predictions; double[] prediction = Arrays.stream(proba).map(p -> p > threshold ? 1.0 : 0.0).toArray(); int correct = 0; for (int i = 0; i < n; i++) { if (actual[i] == prediction[i]) { correct++; } } double accuracy = 1.0 * correct / n; Accuracy is the simplest evaluation metric and everybody understands it. Precision, Recall and F1 In some cases accuracy is not the best measure of model performance. For example, suppose we have an unbalanced dataset: there are only 1% of examples that are positive. Then a model which always predict negative is right in 99% cases, and hence will have accuracy of 0.99. But this model is not useful. There are alternatives to accuracy that can overcome this problem. Precision and Recall are among these metrics: they both look at the fraction of positive items that the model correctly recognized. They can be calculated using the Confusion Matrix: a table which summarizes the performance of a binary classifier: Precision is the fraction of correctly predicted positive items among all items the model predicted positive. In terms of the confusion matrix, Precision is TP / (TP + FP). Recall is the fraction of correctly predicted positive items among items that are actually positive. With values from the confusion matrix, Recall is TP / (TP + FN). It is often hard to decide whether one should optimize Precision or Recall. But there is another metric which combines both Precision and Recall into one number, and it is called F1 score. For calculating Precision and Recall, we first need to calculate the values for the cells of the confusion matrix: int tp = 0, tn = 0, fp = 0, fn = 0; for (int i = 0; i < actual.length; i++) { if (actual[i] == 1.0 && proba[i] > threshold) { tp++; } else if (actual[i] == 0.0 && proba[i] <= threshold) { tn++; } else if (actual[i] == 0.0 && proba[i] > threshold) { fp++; } else if (actual[i] == 1.0 && proba[i] <= threshold) { fn++; } } Then we can use the values to calculate Precision and Recall: double precision = 1.0 * tp / (tp + fp); double recall = 1.0 * tp / (tp + fn); Finally, F1 can be calculated using the following formula: double f1 = 2 * precision * recall / (precision + recall); ROC and AU ROC (AUC) The metrics above are good for binary classifiers which produce hard output: they only tell if the class should be assigned a positive label or negative. If instead our model outputs some score such that the higher the values of the score the more likely the item is to be positive, then the binary classifier is called a ranking classifier. Most of the models can output probabilities of belonging to a certain class, and we can use it to rank examples such that the positive are likely to come first. The ROC Curve visually tells us how good a ranking classifier separates positive examples from negative ones. The way a ROC curve is build is the following: We sort the observations by their score and then starting from the origin we go up if the observation is positive and right if it is negative. This way, in the ideal case, we first always go up, and then always go right – and this will result in the best possible ROC curve. In this case we can say that the separation between positive and negative examples is perfect. If the separation is not perfect, but still OK, the curve will go up for positive examples, but sometimes will turn right when a misclassification occurred. Finally, a bad classifier will not be able to tell positive and negative examples apart and the curve would alternate between going up and right. The diagonal line on the plot represents the baseline – the performance that a random classifier would achieve. The further away the curve from the baseline, the better. Unfortunately, there is no available easy-to-use implementation of ROC curves in Java. So the algorithm for drawing a ROC curve is the following: Let POS be number of positive labels, and NEG be the number of negative labels Order data by the score, decreasing Start from (0, 0) For each example in the sorted order, if the example is positive, move 1 / POS up in the graph, otherwise, move 1 / NEG right in the graph. This is a simplified algorithm and assumes that the scores are distinct. If the scores aren't distinct, and there are different actual labels for the same score, some adjustment needs to be made. It is implemented in the class RocCurve which you will find in the source code. You can use it as following: RocCurve.plot(actual, prediction); Calling it will create a plot similar to this one: The area under the curve says how good the separation is. If the separation is very good, then the area will be close to one. But if the classifier cannot distinguish between positive and negative examples, the curve will go around the random baseline curve, and the area will be close to 0.5. Area Under the Curve is often abbreviated as AUC, or, sometimes, AU ROC – to emphasize that the Curve is a ROC Curve. AUC has a very nice interpretation: the value of AUC corresponds to probability that a randomly selected positive example is scored higher than a randomly selected negative example. Naturally, if this probability is high, our classifier does a good job separating positive and negative examples. This makes AUC a to-go evaluation metric for many cases, especially when the dataset is unbalanced – in the sense that there are a lot more examples of one class than another. Luckily, there are implementations of AUC in Java. For example, it is implemented in Smile. You can use it like this: double[] predicted = ... // int[] truth = ... // double auc = AUC.measure(truth, predicted); Result Validation When learning from data there is always the danger of overfitting. Overfitting occurs when the model starts learning the noise in the data instead of detecting useful patterns. It is always important to check if a model overfits – otherwise it will not be useful when applied to unseen data. The typical and most practical way of checking whether a model overfits or not is to emulate “unseen data” – that is, take a part of the available labeled data and do not use it for training. This technique is called “hold out”: we hold out a part of the data and use it only for evaluation. Often we shuffle the original data set before splitting. In many cases we make a simplifying assumption that the order of data is not important – that is, one observation has no influence on another. In this case shuffling the data prior to splitting will remove effects that the order of items might have. On the other hand, if the data is a Time Series data, then shuffling it is not a good idea, because there is some dependence between observations. So, let us implement the hold out split. We assume that the data that we have is already represented by X – a two-dimensional array of doubles with features and y – a one-dimensional array of labels. First, let us create a helper class for holding the data: public class Dataset { private final double[][] X; private final double[] y; // constructor and getters are omitted } Splitting our dataset should produce two datasets, so let us create a class for that as well: public class Split { private final Dataset train; private final Dataset test; // constructor and getters are omitted } Now suppose we want to split the data into two parts: train and test. We also want to specify the size of the train set, we will do it using a testRatio parameter: the percentage of items that should go to the test set. So the first thing we do is generating an array with indexes and then splitting it according to testRatio: int[] indexes = IntStream.range(0, dataset.length()).toArray(); int trainSize = (int) (indexes.length * (1 - testRatio)); int[] trainIndex = Arrays.copyOfRange(indexes, 0, trainSize); int[] testIndex = Arrays.copyOfRange(indexes, trainSize, indexes.length); We can also shuffle the indexes if we want: Random rnd = new Random(seed); for (int i = indexes.length - 1; i > 0; i--) { int index = rnd.nextInt(i + 1); int tmp = indexes[index]; indexes[index] = indexes[i]; indexes[i] = tmp; } Then we can select instances for the training set as follows: int trainSize = trainIndex.length; double[][] trainX = new double[trainSize][]; double[] trainY = new double[trainSize]; for (int i = 0; i < trainSize; i++) { int idx = trainIndex[i]; trainX[i] = X[idx]; trainY[i] = y[idx]; } And then finally wrap it into our Dataset class: Dataset train = new Dataset(trainX, trainY); If we repeat the same for the test set, we can put both train and test sets into a Split object: Split split = new Split(train, test); And now we can use train fold for training and test fold for testing the models. If we put all the code above into a function of the Dataset class, for example, trainTestSplit, we can use it as follows: Split split = dataset.trainTestSplit(0.2); Dataset train = split.getTrain(); // train the model using train.getX() and train.getY() Dataset test = split.getTest(); // test the model using test.getX(); test.getY(); K-Fold Cross Validation Holding out only one part of the data may not always be the best option. What we can do instead is splitting it into K parts and then testing the models only on 1/Kth of the data. This is called K-Fold Cross-Validation: it not only gives the performance estimation, but also the possible spread of the error. Typically we are interested in models which give good and consistent performance. K-Fold Cross-Validation helps us to select such models. Then we prepare the data for K-Fold Cross-Validation is the following: First, split the data into K parts Then for each of these parts Take one part as the validation set Take the remaining K-1 parts as the training set If we translate this into Java, the first step will look like this: int[] indexes = IntStream.range(0, dataset.length()).toArray(); int[][] foldIndexes = new int[k][]; int step = indexes.length / k; int beginIndex = 0; for (int i = 0; i < k - 1; i++) { foldIndexes[i] = Arrays.copyOfRange(indexes, beginIndex, beginIndex + step); beginIndex = beginIndex + step; } foldIndexes[k - 1] = Arrays.copyOfRange(indexes, beginIndex, indexes.length); This creates an array of indexes for each fold. You can also shuffle the indexes array as previously. Now we can create splits from each fold: List<Split> result = new ArrayList<>(); for (int i = 0; i < k; i++) { int[] testIdx = folds[i]; int[] trainIdx = combineTrainFolds(folds, indexes.length, i); result.add(Split.fromIndexes(dataset, trainIdx, testIdx)); } In the code above we have two additional methods: combineTrainFolds K-1 arrays with indexes and combines them into one Split.fromIndexes creates a split gives train and test indexes. We have already covered the second function when we created a simple hold-out test set. And the first function, combineTrainFolds, is implemented like this: private static int[] combineTrainFolds(int[][] folds, int totalSize, int excludeIndex) { int size = totalSize - folds[excludeIndex].length; int result[] = new int[size]; int start = 0; for (int i = 0; i < folds.length; i++) { if (i == excludeIndex) { continue; } int[] fold = folds[i]; System.arraycopy(fold, 0, result, start, fold.length); start = start + fold.length; } return result; } Again, we can put the code above into a function of the Dataset class and call it like follows: List<Split> folds = train.kfold(3); Now when we have a list of Split objects, we can create a special function for performing Cross-Validation: public static DescriptiveStatistics crossValidate(List<Split> folds, Function<Dataset, Model> trainer) { double[] aucs = folds.parallelStream().mapToDouble(fold -> { Dataset foldTrain = fold.getTrain(); Dataset foldValidation = fold.getTest(); Model model = trainer.apply(foldTrain); return auc(model, foldValidation); }).toArray(); return new DescriptiveStatistics(aucs); } What this function does takes a list of folds and a callback which inside creates a model. Then, after the model is trained, we calculate AUC for it. Additionally, we take advantage of Java's ability to parallelize loops and train models on each fold at the same time. Finally, we put the AUCs calculated on each fold into a DescriptiveStatistics object, which can later on be used to return the mean and the standard deviation of the AUCs. As you probably remember, the DescriptiveStatistics class comes from the Apache Commons Math library. Let us consider an example. Suppose we want to use Logistic Regression from LIBLINEAR and select the best value for the regularization parameter C. We can use the function above this way: double[] Cs = { 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0 }; for (double C : Cs) { DescriptiveStatistics summary = crossValidate(folds, fold -> { Parameter param = new Parameter(SolverType.L1R_LR, C, 0.0001); return LibLinear.train(fold, param); }); double mean = summary.getMean(); double std = summary.getStandardDeviation(); System.out.printf("L1 logreg C=%7.3f, auc=%.4f ± %.4f%n", C, mean, std); } Here, LibLinear.train is a helper method which takes a Dataset object and a Parameter object and then trains a LIBLINEAR model. This will print AUC for all provided values of C, so you can see which one is the best, and pick the one with highest mean AUC. Training, Validation and Testing When doing the Cross-Validation there’s still a danger of overfitting. Since we try a lot of different experiments on the same validation set, we might accidentally pick the model which just happened to do well on the validation set – but it may later on fail to generalize to unseen data. The solution to this problem is to hold out a test set at the very beginning and do not touch it at all until we select what we think is the best model. And we use it only for evaluating the final model on it. So how do we select the best model? What we can do is to do Cross-Validation on the remaining train data. It can be either hold out or K-Fold Cross-Validation. In general you should prefer doing K-Fold Cross-Validation because it also gives you the spread of performance, and you may use it in for model selection as well. The typical workflow should be the following: (0) Select some metric for validation, e.g. accuracy or AUC. (1) Split all the data into train and test sets (2) Split the training data further and hold out a validation dataset or split it into K folds (3) Use the validation data for model selection and parameter optimization (4) Select the best model according to the validation set and evaluate it against the hold out test set It is important to avoid looking at the test set too often. It should be used only occasionally for final evaluation to make sure the selected model does not overfit. If the validation scheme is set up properly, the validation score should correspond to the final test score. If this happens, we can be sure that the model does not overfit and is able to generalize to unseen data. Using the classes and the code we created previously, it translates to the following Java code: Dataset data = new Dataset(X, y); Dataset train = split.getTrain(); List<Split> folds = train.kfold(3); // now use crossValidate(folds, ...) to select the best model Dataset test = split.getTest(); // do final evaluation of the best model on test With this information we are ready to do a project on Binary Classification. Summary In this article we spoke about supervised machine learning and about two common supervised problems: Classification and Regression. We also covered the libraries which are commonly used algorithms , implemented and how to evaluate the performance of these algorithms. There is another family of Machine Learning algorithms that do not require the label information: these methods are called Unsupervised Learning. Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Clustering and Other Unsupervised Learning Methods [article] Data Clustering [article]
Read more
  • 0
  • 0
  • 5294

article-image-using-raspberry-pi-camera-module
Packt
06 Apr 2017
6 min read
Save for later

Using the Raspberry Pi Camera Module

Packt
06 Apr 2017
6 min read
This article by Shervin Emami, co-author of the book, Mastering OpenCV 3 - Second Edition, explains how to use the Raspberry Pi Camera Module for your Cartoonifier and Skin changer applications. While using a USB webcam on Raspberry Pi has the convenience of supporting identical behavior & code on desktop as on embedded device, you might consider using one of the official Raspberry Pi Camera Modules (referred to as the RPi Cams). They have some advantages and disadvantages over USB webcams: (For more resources related to this topic, see here.) RPi Cams uses the special MIPI CSI camera format, designed for smartphone cameras to use less power, smaller physical size, faster bandwidth, higher resolutions, higher framerates, and reduced latency, compared to USB. Most USB2.0 webcams can only deliver 640x480 or 1280x720 30 FPS video, since USB2.0 is too slow for anything higher (except for some expensive USB webcams that perform onboard video compression) and USB3.0 is still too expensive. Whereas smartphone cameras (including the RPi Cams) can often deliver 1920x1080 30 FPS or even Ultra HD / 4K resolutions. The RPi Cam v1 can in fact deliver up to 2592x1944 15 FPS or 1920x1080 30 FPS video even on a $5 Raspberry Pi Zero, thanks to the use of MIPI CSI for the camera and a compatible video processing ISP & GPU hardware inside the Raspberry Pi. The RPi Cam also supports 640x480 in 90 FPS mode (such as for slow-motion capture), and this is quite useful for real-time computer vision so you can see very small movements in each frame, rather than large movements that are harder to analyze. However the RPi Cam is a plain circuit board that is highly sensitive to electrical interference, static electricity or physical damage (simply touching the small orange flat cable with your finger can cause video interference or even permanently damage your camera!). The big flat white cable is far less sensitive but it is still very sensitive to electrical noise or physical damage. The RPi Cam comes with a very short 15cm cable. It's possible to buy third-party cables on eBay with lengths between 5cm to 1m, but cables 50cm or longer are less reliable, whereas USB webcams can use 2m to 5m cables and can be plugged into USB hubs or active extension cables for longer distances. There are currently several different RPi Cam models, notably the NoIR version that doesn't have an internal Infrared filter, therefore a NoIR camera can easily see in the dark (if you have an invisible Infrared light source), or see Infrared lasers or signals far clearer than regular cameras that include an Infrared filter inside them. There are also 2 different versions of RPi Cam (shown above): RPi Cam v1.3 and RPi Cam v2.1, where the v2.1 uses a wider angle lens with a Sony 8 Mega-Pixel sensor instead of a 5 Mega-Pixel OmniVision sensor, and has better support for motion in low lighting conditions, and adds support for 3240x2464 15 FPS video and potentially upto 120 FPS video at 720p. However, USB webcams come in thousands of different shapes & versions, making it easy to find specialized webcams such as waterproof or industrial-grade webcams, rather than requiring you to create your own custom housing for an RPi Cam. IP Cameras are also another option for a camera interface that can allow 1080p or higher resolution videos with Raspberry Pi, and IP cameras support not just very long cables, but potentially even work anywhere in the world using Internet. But IP cameras aren't quite as easy to interface with OpenCV as USB webcams or the RPi Cam. In the past, RPi Cams and the official drivers weren't directly compatible with OpenCV, you often used custom drivers and modified your code in order to grab frames from RPi Cams, but it's now possible to access an RPi Cam in OpenCV the exact same way as a USB Webcam! Thanks to recent improvements in the v4l2 drivers, once you load the v4l2 driver the RPi Cam will appear as file /dev/video0 or /dev/video1 like a regular USB webcam. So traditional OpenCV webcam code such as cv::VideoCapture(0) will be able to use it just like a webcam. Installing the Raspberry Pi Camera Module driver First let's temporarily load the v4l2 driver for the RPi Cam to make sure our camera is plugged in correctly: sudo modprobe bcm2835-v4l2 If the command failed (that is, it printed an error message to the console, or it froze, or the command returned a number beside 0), then perhaps your camera is not plugged in correctly. Shutdown then unplug power from your RPi and try attaching the flat white cable again, looking at photos on the Web to make sure it's plugged in the correct way around. If it is the correct way around, it's possible the cable wasn't fully inserted before you closed the locking tab on the RPi. Also check if you forgot to click Enable Camera when configuring your Raspberry Pi earlier, using the sudo raspi-config command. If the command worked (that is, the command returned 0 and no error was printed to the console), then we can make sure the v4l2 driver for the RPi Cam is always loaded on bootup, by adding it to the bottom of the /etc/modules file: sudo nano /etc/modules # Load the Raspberry Pi Camera Module v4l2 driver on bootup: bcm2835-v4l2 After you save the file and reboot your RPi, you should be able to run ls /dev/video* to see a list of cameras available on your RPi. If the RPi Cam is the only camera plugged into your board, you should see it as the default camera (/dev/video0), or if you also have a USB webcam plugged in then it will be either /dev/video0 or /dev/video1. Let's test the RPi Cam using the starter_video sample program: cd ~/opencv-3.*/samples/cpp DISPLAY=:0 ./starter_video 0 If it's showing the wrong camera, try DISPLAY=:0 ./starter_video 1. Now that we know the RPi Cam is working in OpenCV, let's try Cartoonifier! cd ~/Cartoonifier DISPLAY=:0 ./Cartoonifier 0 (or DISPLAY=:0 ./Cartoonifier 1 for the other camera). Resources for Article: Further resources on this subject: Video Surveillance, Background Modeling [article] Performance by Design [article] Getting started with Android Development [article]
Read more
  • 0
  • 0
  • 5865

article-image-learning-cassandra
Packt
06 Apr 2017
17 min read
Save for later

Learning Cassandra

Packt
06 Apr 2017
17 min read
In this article by Sandeep Yarabarla, the author of the book Learning Apache Cassandra - Second Edition, we will built products using relational databases like MySQL and PostgreSQL, and perhaps experimented with NoSQL databases including a document store like MongoDB or a key-value store like Redis. While each of these tools has its strengths, you will now consider whether a distributed database like Cassandra might be the best choice for the task at hand. In this article, we'll begin with the need for NoSQL databases to satisfy the conundrum of ever growing data. We will see why NoSQL databases are becoming the de facto choice for big data and real-time web applications. We will also talk about the major reasons to choose Cassandra from among the many database options available to you. Having established that Cassandra is a great choice, we'll go through the nuts and bolts of getting a local Cassandra installation up and running (For more resources related to this topic, see here.) What is Big Data Big Data is a relatively new term which has been gathering steam over the past few years. Big Data is a term used for data sets that are relatively large to be stored in a traditional database system or processed by traditional data processing pipelines. This data could be structured, semi-structured or unstructured data. The data sets that belong to this category usually scale to terabytes or petabytes of data. Big Data usually involves one or more of the following: Velocity: Data moves at an unprecedented speed and must be dealt with it in a timely manner. Volume: Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. This could involve terabytes to petabytes of data. In the past, storing it would've been a problem – but new technologies have eased the burden. Variety: Data comes in all sorts of formats ranging from structured data to be stored in traditional databases to unstructured data (blobs) such as images, audio files and text files. These are known as the 3 Vs of Big Data. In addition to these, we tend to associate another term with Big Data. Complexity: Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it's necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. It must be able to traverse multiple data centers, cloud and geographical zones. Challenges of modern applications Before we delve into the shortcomings of relational systems to handle Big Data, let's take a look at some of the challenges faced by modern web facing and Big Data applications. Later, this will give an insight into how NoSQL data stores or Cassandra in particular helps solve these issues. One of the most important challenges faced by a web facing application is it should be able to handle a large number of concurrent users. Think of a search engine like google which handles millions of concurrent users at any given point of time or a large online retailer. The response from these applications should be swift even as the number of users keeps on growing. Modern applications need to be able to handle large amounts of data which can scale to several petabytes of data and beyond. Consider a large social network with a few hundred million users: Think of the amount of data generated in tracking and managing those users Think of how this data can be used for analytics Business critical applications should continue running without much impact even when there is a system failure or multiple system failures (server failure, network failure, and so on.). The applications should be able to handle failures gracefully without any data loss or interruptions. These applications should be able to scale across multiple data centers and geographical regions to support customers from different regions around the world with minimum delay. Modern applications should be implementing fully distributed architectures and should be capable of scaling out horizontally to support any data size or any number of concurrent users. Why not relational Relational database systems (RDBMS) have been the primary data store for enterprise applications for 20 years. Lately, NoSQL databases have been picking a lot of steam and businesses are slowly seeing a shift towards non-relational databases. There are a few reasons why relational databases don't seem like a good fit for modern web applications. Relational databases are not designed for clustered solutions. There are some solutions that shard data across servers but these are fragile, complex and generally don't work well. Sharding solutions implemented by RDBMS MySQL's product MySQL cluster provides clustering support which adds many capabilities of non-relational systems. It is actually a NoSQL solution that integrates with MySQL relational database. It partitions the data onto multiple nodes and the data can be accessed via different APIs. Oracle provides a clustering solution, Oracle RAC which involves multiple nodes running an oracle process accessing the same database files. This creates a single point of failure as well as resource limitations in accessing the database itself. They are not a good fit current hardware and architectures. Relational databases are usually scaled up by using larger machines with more powerful hardware and maybe clustering and replication among a small number of nodes. Their core architecture is not a good fit for commodity hardware and thus doesn't work with scale out architectures. Scale out vs Scale up architecture Scaling out means adding more nodes to a system such as adding more servers to a distributed database or filesystem. This is also known as horizontal scaling. Scaling up means adding more resources to a single node within the system such as adding more CPU, memory or disks to a server. This is also known as vertical scaling. How to handle Big Data Now that we are convinced the relational model is not a good fit for Big Data, let's try to figure out ways to handle Big Data. These are the solutions that paved the way for various NoSQL databases. Clustering: The data should be spread across different nodes in a cluster. The data should be replicated across multiple nodes in order to sustain node failures. This helps spread the data across the cluster and different nodes contain different subsets of data. This improves performance and provides fault tolerance. A node is an instance of database software running on a server. Multiple instances of the same database could be running on the same server. Flexible schema: Schemas should be flexible unlike the relational model and should evolve with data. Relax consistency: We should embrace the concept of eventual consistency which means data will eventually be propagated to all the nodes in the cluster (in case of replication). Eventual consistency allows data replication across nodes with minimum overhead. This allows for fast writes with the need for distributed locking. De-normalization of data: De-normalize data to optimize queries. This has to be done at the cost of writing and maintaining multiple copies of the same data. What is Cassandra and why Cassandra Cassandra is a fully distributed, masterless database, offering superior scalability and fault tolerance to traditional single master databases. Compared with other popular distributed databases like Riak, HBase, and Voldemort, Cassandra offers a uniquely robust and expressive interface for modeling and querying data. What follows is an overview of several desirable database capabilities, with accompanying discussions of what Cassandra has to offer in each category. Horizontal scalability Horizontal scalability refers to the ability to expand the storage and processing capacity of a database by adding more servers to a database cluster. A traditional single-master database's storage capacity is limited by the capacity of the server that hosts the master instance. If the data set outgrows this capacity, and a more powerful server isn't available, the data set must be sharded among multiple independent database instances that know nothing of each other. Your application bears responsibility for knowing to which instance a given piece of data belongs. Cassandra, on the other hand, is deployed as a cluster of instances that are all aware of each other. From the client application's standpoint, the cluster is a single entity; the application need not know, nor care, which machine a piece of data belongs to. Instead, data can be read or written to any instance in the cluster, referred to as a node; this node will forward the request to the instance where the data actually belongs. The result is that Cassandra deployments have an almost limitless capacity to store and process data. When additional capacity is required, more machines can simply be added to the cluster. When new machines join the cluster, Cassandra takes careof rebalancing the existing data so that each node in the expanded cluster has a roughly equal share. Also, the performance of a Cassandra cluster is directly proportional to the number of nodes within the cluster. As you keep on adding instances, the read and write throughput will keep increasing proportionally. Cassandra is one of the several popular distributed databases inspired by the Dynamo architecture, originally published in a paper by Amazon. Other widely used implementations of Dynamo include Riak and Voldemort. You can read the original paper at http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf High availability The simplest database deployments are run as a single instance on a single server. This sort of configuration is highly vulnerable to interruption: if the server is affected by a hardware failure or network connection outage, the application's ability to read and write data is completely lost until the server is restored. If the failure is catastrophic, the data on that server might be lost completely. Master-slave architectures improve this picture a bit. The master instance receives all write operations, and then these operations are replicated to follower instances. The application can read data from the master or any of the follower instances, so a single host becoming unavailable will not prevent the application from continuing to read data. A failure of the master, however, will still prevent the application from performing any write operations, so while this configuration provides high read availability, it doesn't completely provide high availability. Cassandra, on the other hand, has no single point of failure for reading or writing data. Each piece of data is replicated to multiple nodes, but none of these nodes holds the authoritative master copy. All the nodes in a Cassandra cluster are peers without a master node. If a machine becomes unavailable, Cassandra will continue writing data to the other nodes that share data with that machine, and will queue the operations and update the failed node when it rejoins the cluster. This means in a typical configuration, multiple nodes must fail simultaneously for there to be any application-visible interruption in Cassandra's availability. How many copies? When you create a keyspace - Cassandra's version of a database, you specify how many copies of each piece of data should be stored; this is called the replication factor. A replication factor of 3 is a common choice for most use cases. Write optimization Traditional relational and document databases are optimized for read performance. Writing data to a relational database will typically involve making in-place updates to complicated data structures on disk, in order to maintain a data structure that can be read efficiently and flexibly. Updating these data structures is a very expensive operation from a standpoint of disk I/O, which is often the limiting factor for database performance. Since writes are more expensive than reads, you'll typically avoid any unnecessary updates to a relational database, even at the expense of extra read operations. Cassandra, on the other hand, is highly optimized for write throughput, and in fact never modifies data on disk; it only appends to existing files or creates new ones. This is much easier on disk I/O and means that Cassandra can provide astonishingly high write throughput. Since both writing data to Cassandra, and storing data in Cassandra, are inexpensive, denormalization carries little cost and is a good way to ensure that data can be efficiently read in various access scenarios. Because Cassandra is optimized for write volume; you shouldn't shy away from writing data to the database. In fact, it's most efficient to write without reading whenever possible, even if doing so might result in redundant updates. Just because Cassandra is optimized for writes doesn't make it bad at reads; in fact, a well-designed Cassandra database can handle very heavy read loads with no problem. Structured records The first three database features we looked at are commonly found in distributed data stores. However, databases like Riak and Voldemort are purely key-value stores; these databases have no knowledge of the internal structure of a record that's stored at a particular key. This means useful functions like updating only part of a record, reading only certain fields from a record, or retrieving records that contain a particular value in a given field are not possible. Relational databases like PostgreSQL, document stores like MongoDB, and, to a limited extent, newer key-value stores like Redis do have a concept of the internal structure of their records, and most application developers are accustomed to taking advantage of the possibilities this allows. None of these databases, however, offer the advantages of a masterless distributed architecture. In Cassandra, records are structured much in the same way as they are in a relational database—using tables, rows, and columns. Thus, applications using Cassandra can enjoy all the benefits of masterless distributed storage while also getting all the advanced data modeling and access features associated with structured records. Secondary indexes A secondary index, commonly referred to as an index in the context of a relational database, is a structure allowing efficient lookup of records by some attribute other than their primary key. This is a widely useful capability: for instance, when developing a blog application, you would want to be able to easily retrieve all of the posts written by a particular author. Cassandra supports secondary indexes; while Cassandra's version is not as versatile as indexes in a typical relational database, it's a powerful feature in the right circumstances. Materialized views Data modelling principles in Cassandra compel us to denormalize data as much as possible. Prior to Cassandra 3.0, the only way to query on a non-primary key column was to create a secondary index and query on it. However, secondary indexes have a performance tradeoff if it contains high cardinality data. Often, high cardinality secondary indexes have to scan data on all the nodes and aggregate them to return the query results. This defeats the purpose of having a distributed system. To avoid secondary indexes and client-side denormalization, Cassandra introduced the feature of Materialized views which does server-side denormalization. You can create views for a base table and Cassandra ensures eventual consistency between the base and view. This lets us do very fast lookups on each view following the normal Cassandra read path. Materialized views maintain a correspondence of one CQL row each in the base and the view, so we need to ensure that each CQL row which is required for the views will be reflected in the base table's primary keys. Although, materialized view allows for fast lookups on non-primary key indexes; this comes at a performance hit to writes. Efficient result ordering It's quite common to want to retrieve a record set ordered by a particular field; for instance, a photo sharing service will want to retrieve the most recent photographs in descending order of creation. Since sorting data on the fly is a fundamentally expensive operation, databases must keep information about record ordering persisted on disk in order to efficiently return results in order. In a relational database, this is one of the jobs of a secondary index. In Cassandra, secondary indexes can't be used for result ordering, but tables can be structured such that rows are always kept sorted by a given column or columns, called clustering columns. Sorting by arbitrary columns at read time is not possible, but the capacity to efficiently order records in any way, and to retrieve ranges of records based on this ordering, is an unusually powerful capability for a distributed database. Immediate consistency When we write a piece of data to a database, it is our hope that that data is immediately available to any other process that may wish to read it. From another point of view, when we read some data from a database, we would like to be guaranteed that the data we retrieve is the most recently updated version. This guarantee is calledimmediate consistency, and it's a property of most common single-master databases like MySQL and PostgreSQL. Distributed systems like Cassandra typically do not provide an immediate consistency guarantee. Instead, developers must be willing to accept eventual consistency, which means when data is updated, the system will reflect that update at some point in the future. Developers are willing to give up immediate consistency precisely because it is a direct tradeoff with high availability. In the case of Cassandra, that tradeoff is made explicit through tunable consistency. Each time you design a write or read path for data, you have the option of immediate consistency with less resilient availability, or eventual consistency with extremely resilient availability. Discretely writable collections While it's useful for records to be internally structured into discrete fields, a given property of a record isn't always a single value like a string or an integer. One simple way to handle fields that contain collections of values is to serialize them using a format like JSON, and then save the serialized collection into a text field. However, in order to update collections stored in this way, the serialized data must be read from the database, decoded, modified, and then written back to the database in its entirety. If two clients try to perform this kind of modification to the same record concurrently, one of the updates will be overwritten by the other. For this reason, many databases offer built-in collection structures that can be discretely updated: values can be added to, and removed from collections, without reading and rewriting the entire collection. Neither the client nor Cassandra itself needs to read the current state of the collection in order to update it, meaning collection updates are also blazingly efficient. Relational joins In real-world applications, different pieces of data relate to each other in a variety of ways. Relational databases allow us to perform queries that make these relationships explicit, for instance, to retrieve a set of events whose location is in the state of New York (this is assuming events and locations are different record types). Cassandra, however, is not a relational database, and does not support anything like joins. Instead, applications using Cassandra typically denormalize data and make clever use of clustering in order to perform the sorts of data access that would use a join in a relational database. For datasets that aren't already denormalized, applications can also perform client-side joins, which mimic the behavior of a relational database by performing multiple queries and joining the results at the application level. Client-side joins are less efficient than reading data that has been denormalized in advance, but offer more flexibility. Summary In this article, you explored the reasons to choose Cassandra from among the many databases available, and having determined that Cassandra is a great choice, you installed it on your development machine. You had your first taste of the Cassandra Query Language when you issued your first few commands through the CQL shell in order to create a keyspace, table, and insert and read data. You're now poised to begin working with Cassandra in earnest. Resources for Article: Further resources on this subject: Setting up a Kubernetes Cluster [article] Dealing with Legacy Code [article] Evolving the data model [article]
Read more
  • 0
  • 0
  • 2233

article-image-convolutional-neural-networks-reinforcement-learning
Packt
06 Apr 2017
9 min read
Save for later

Convolutional Neural Networks with Reinforcement Learning

Packt
06 Apr 2017
9 min read
In this article by Antonio Gulli, Sujit Pal, the authors of the book Deep Learning with Keras, we will learn about reinforcement learning, or more specifically deep reinforcement learning, that is, the application of deep neural networks to reinforcement learning. We will also see how convolutional neural networks leverage spatial information and they are therefore very well suited for classifying images. (For more resources related to this topic, see here.) Deep convolutional neural network A Deep Convolutional Neural Network (DCCN) consists of many neural network layers. Two different types of layers, convolutional and pooling, are typically alternated. The depth of each filter increases from left to right in the network. The last stage is typically made of one or more fully connected layers as shown here: There are three key intuitions beyond ConvNets: Local receptive fields Shared weights Pooling Let's review them together. Local receptive fields If we want to preserve the spatial information, then it is convenient to represent each image with a matrix of pixels. Then, a simple way to encode the local structure is to connect a submatrix of adjacent input neurons into one single hidden neuron belonging to the next layer. That single hidden neuron represents one local receptive field. Note that this operation is named convolution and it gives the name to this type of networks. Of course we can encode more information by having overlapping submatrices. For instance let's suppose that the size of each single submatrix is 5 x 5 and that those submatrices are used with MNIST images of 28 x 28 pixels, then we will be able to generate 23 x 23 local receptive field neurons in the next hidden layer. In fact it is possible to slide the submatrices by only 23 positions before touching the borders of the images. In Keras, the size of each single submatrix is called stride-length and this is an hyper-parameter which can be fine-tuned during the construction of our nets. Let's define the feature map from one layer to another layer. Of course we can have multiple feature maps which learn independently from each hidden layer. For instance we can start with 28 x 28 input neurons for processing MINST images, and then recall k feature maps of size 23 x 23 neurons each (again with stride of 5 x 5) in the next hidden layer. Shared weights and bias Let's suppose that we want to move away from the pixel representation in a row by gaining the ability of detecting the same feature independently from the location where it is placed in the input image. A simple intuition is to use the same set of weights and bias for all the neurons in the hidden layers. In this way each layer will learn a set of position-independent latent features derived from the image. Assuming that the input image has shape (256, 256) on 3 channels with tf (Tensorflow) ordering, this is represented as (256, 256, 3). Note that with th (Theano) mode the channels dimension (the depth) is at index 1, in tf mode is it at index 3. In Keras if we want to add a convolutional layer with dimensionality of the output 32 and extension of each filter 3 x 3 we will write: model = Sequential() model.add(Convolution2D(32, 3, 3, input_shape=(256, 256, 3)) This means that we are applying a 3 x 3 convolution on 256 x 256 image with 3 input channels (or input filters) resulting in 32 output channels (or output filters). An example of convolution is provided in the following diagram: Pooling layers Let's suppose that we want to summarize the output of a feature map. Again, we can use the spatial contiguity of the output produced from a single feature map and aggregate the values of a submatrix into one single output value synthetically describing the meaning associated with that physical region. Max pooling One easy and common choice is the so-called max pooling operator which simply outputs the maximum activation as observed in the region. In Keras, if we want to define a max pooling layer of size 2 x 2 we will write: model.add(MaxPooling2D(pool_size = (2, 2))) An example of max pooling operation is given in the following diagram: Average pooling Another choice is the average pooling which simply aggregates a region into the average values of the activations observed in that region. Keras implements a large number of pooling layers and a complete list is available online. In short, all the pooling operations are nothing more than a summary operation on a given region. Reinforcement learning Our objective is to build a neural network to play the game of catch. Each game starts with a ball being dropped from a random position from the top of the screen. The objective is to move a paddle at the bottom of the screen using the left and right arrow keys to catch the ball by the time it reaches the bottom. As games go, this is quite simple. At any point in time, the state of this game is given by the (x, y) coordinates of the ball and paddle. Most arcade games tend to have many more moving parts, so a general solution is to provide the entire current game screen image as the state. The following diagram shows four consecutive screenshots of our catch game: Astute readers might note that our problem could be modeled as a classification problem, where the input to the network are the game screen images and the output is one of three actions - move left, stay, or move right. However, this would require us to provide the network with training examples, possibly from recordings of games played by experts. An alternative and simpler approach might be to build a network and have it play the game repeatedly, giving it feedback based on whether it succeeds in catching the ball or not. This approach is also more intuitive and is closer to the way humans and animals learn. The most common way to represent such a problem is through a Markov Decision Process (MDP). Our game is the environment within which the agent is trying to learn. The state of the environment at time step t is given by st (and contains the location of the ball and paddle). The agent can perform certain actions at (such as moving the paddle left or right). These actions can sometimes result in a reward rt, which can be positive or negative (such as an increase or decrease in the score). Actions change the environment and can lead to a new state st+1, where the agent can perform another action at+1, and so on. The set of states, actions and rewards, together with the rules for transitioning from one state to the other, make up a Markov decision process. A single game is one episode of this process, and is represented by a finite sequence of states, actions and rewards: Since this is a Markov decision process, the probability of state st+1 depends only on current state st and action at. Maximizing future rewards As an agent, our objective is to maximize the total reward from each game. The total reward can be represented as follows: In order to maximize the total reward, the agent should try to maximize the total reward from any time point t in the game. The total reward at time step t is given by Rt and is represented as: However, it is harder to predict the value of the rewards the further we go into the future. In order to take this into consideration, our agent should try to maximize the total discounted future reward at time t instead. This is done by discounting the reward at each future time step by a factor γ over the previous time step. If γ is 0, then our network does not consider future rewards at all, and if γ is 1, then our network is completely deterministic. A good value for γ is around 0.9. Factoring the equation allows us to express the total discounted future reward at a given time step recursively as the sum of the current reward and the total discounted future reward at the next time step: Q-learning Deep reinforcement learning utilizes a model-free reinforcement learning technique called Q-learning. Q-learning can be used to find an optimal action for any given state in a finite Markov decision process. Q-learning tries to maximize the value of the Q-function which represents the maximum discounted future reward when we perform action a in state s: Once we know the Q-function, the optimal action a at a state s is the one with the highest Q-value. We can then define a policy π(s) that gives us the optimal action at any state: We can define the Q-function for a transition point (st, at, rt, st+1) in terms of the Q-function at the next point (st+1, at+1, rt+1, st+2) similar to how we did with the total discounted future reward. This equation is known as the Bellmann equation. The Q-function can be approximated using the Bellman equation. You can think of the Q-function as a lookup table (called a Q-table) where the states (denoted by s) are rows and actions (denoted by a) are columns, and the elements (denoted by Q(s, a)) are the rewards that you get if you are in the state given by the row and take the action given by the column. The best action to take at any state is the one with the highest reward. We start by randomly initializing the Q-table, then carry out random actions and observe the rewards to update the Q-table iteratively according to the following algorithm: initialize Q-table Q observe initial state s repeat select and carry out action a observe reward r and move to new state s' Q(s, a) = Q(s, a) + α(r + γ maxa' Q(s', a') - Q(s, a)) s = s' until game over You will realize that the algorithm is basically doing stochastic gradient descent on the Bellman equation, backpropagating the reward through the state space (or episode) and averaging over many trials (or epochs). Here α is the learning rate that determines how much of the difference between the previous Q-value and the discounted new maximum Q-value should be incorporated. Summary We have seen the application of deep neural networks, reinforcement learning. We have also seen convolutional neural networks and how they are well suited for classifying images. Resources for Article: Further resources on this subject: Deep learning and regression analysis [article] Training neural networks efficiently using Keras [article] Implementing Artificial Neural Networks with TensorFlow [article]
Read more
  • 0
  • 0
  • 26890
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
Packt
06 Apr 2017
15 min read
Save for later

Synchronization – An Approach to Delivering Successful Machine Learning Projects

Packt
06 Apr 2017
15 min read
“In the midst of chaos, there is also opportunity”                                                                                                                - Sun Tzu In this article, by Cory Lesmeister, the author of the book Mastering Machine Learning with R - Second Edition, Cory provides insights on ensuring the success and value of your machine learning endeavors. (For more resources related to this topic, see here.) Framing the problem Raise your hand if any of the following has happened or is currently happening to you: You’ve been part of a project team that failed to deliver anything of business value You attend numerous meetings, but they don’t seem productive; maybe they are even complete time wasters Different teams are not sharing information with each other; thus, you are struggling to understand what everyone else is doing, and they have no idea what you are doing or why you are doing it An unknown stakeholder, feeling threatened by your project, comes from out of nowhere and disparages you and/or your work The Executive Committee congratulates your team on their great effort, but decides not to implement it, or even worse, tells you to go back and do it all over again, only this time solve the real problem OK, you can put your hand down now. If you didn’t raise your hand, please send me your contact information because you are about as rare as a unicorn. All organizations, regardless of their size, struggle with integrating different functions, current operations, and other projects. In short, the real-world is filled with chaos. It doesn’t matter how many advanced degrees people have, how experienced they are, how much money is thrown at the problem, what technology is used, how brilliant and powerful the machine learning algorithm is, problems such as those listed above will happen. The bottom line is that implementing machine learning projects in the business world is complicated and prone to failure. However, out of this chaos you have the opportunity to influence your organization by integrating disparate people and teams, fostering a collaborative environment that can adapt to unforeseen changes. But, be warned, this is not easy. If it was easy everyone would be doing it. However, it works and, it works well. By it, I’m talking about the methodology I developed about a dozen years ago, a method I refer to as the “Synchronization Process”. If we ask ourselves, “what are the challenges to implementation”, it seems to me that the following blog post, clearly and succinctly sums it up: https://www.capgemini.com/blog/capping-it-off/2012/04/four-key-challenges-for-business-analytics It enumerates four challenges: Strategic alignment Agility Commitment Information maturity This blog addresses business analytics, but it can be extended to machine learning projects. One could even say machine learning is becoming the analytics tool of choice in many organizations. As such, I will make the case below that the Synchronization Process can effectively deal with the first three challenges. Not only that, the process can provide additional benefits. By overcoming the challenges, you can deliver an effective project, by delivering an effective project you can increase actionable insights and by increasing actionable insights, you will improve decision-making, and that is where the real business value resides. Defining the process “In preparing for battle, I have always found that plans are useless, but planning is indispensable.”                                                                                       - Dwight D. Eisenhower I adopted the term synchronization from the US Army’s operations manual, FM 3-0 where it is described as a battlefield tenet and force multiplier. The manual defines synchronization as, “…arranging activities in time, space and purpose to mass maximum relative combat power at a decisive place and time”. If we overlay this military definition onto the context of a competitive marketplace, we come up with a definition I find more relevant. For our purpose, synchronization is defined as, “arranging business functions and/or tasks in time and purpose to produce the proper amount of focus on a critical event or events”. These definitions put synchronization in the context of an “endstate” based on a plan and a vision. However, it is the process of seeking to achieve that endstate that the true benefits come to fruition. So, we can look at synchronization as not only an endstate, but also as a process. The military’s solution to synchronizing operations before implementing a plan is the wargame. Like the military, businesses and corporations have utilized wargaming to facilitate decision-making and create integration of different business functions. Following the synchronization process techniques explained below, you can take the concept of business wargaming to a new level. I will discuss and provide specific ideas, steps, and deliverables that you can implement immediately. Before we begin that discussion, I want to cover the benefits that the process will deliver. Exploring the benefits of the process When I created this methodology about a dozen years ago, I was part of a market research team struggling to commit our limited resources to numerous projects, all of which were someone’s top priority, in a highly uncertain environment. Or, as I like to refer to it, just another day at the office. I knew from my military experience that I had the tools and techniques to successfully tackle these challenges. It worked then and has been working for me ever since. I have found that it delivers the following benefits to an organization: Integration of business partners and stakeholders Timely and accurate measurement of performance and effectiveness Anticipation of and planning for possible events Adaptation to unforeseen threats Exploitation of unforeseen opportunities Improvement in teamwork Fostering a collaborative environment Improving focus and prioritization In market research, and I believe it applies to all analytical endeavors, including machine learning, we talked about focusing on three specific questions about what to measure: What are we measuring? When do we measure it? How will we measure it? We found that successfully answering those questions facilitated improved decision-making by informing leadership what STOP doing, what to START doing and what to KEEP doing. I have found myself in many meetings going nowhere when I would ask a question like, “what are you looking to stop doing?” Ask leadership what they want to stop, start, or continue to do and you will get to the core of the problem. Then, your job will be to configure the business decision as the measurement/analytical problem. The Synchronization Process can bring this all together in a coherent fashion. I’ve been asked often about what triggers in my mind that a project requires going through the Synchronization Process. Here are some of the questions you should consider, and if you answer “yes” to any of them, it may be a good idea to implement the process: Are resources constrained to the point that several projects will suffer poor results or not be done at all? Do you face multiple, conflicting priorities? Could the external environment change and dramatically influence project(s)? Are numerous stakeholders involved or influenced by a project’s result? Is the project complex and facing a high-level of uncertainty? Does the project involve new technology? Does the project face the actual or potential for organizational change? You may be thinking, “Hey, we have a project manager for all this?” OK, how is that working out? Let me be crystal clear here, this is not just project management! This is about improving decision-making! A Gannt Chart or task management software won’t do that. You must be the agent of change. With that, let’s turn our attention to the process itself. Exploring the process Any team can take the methods elaborated on below and incorporate them to their specific situation with their specific business partners.  If executed properly, one can expect the initial investment in time and effort to provide substantial payoff within weeks of initiating the process. There are just four steps to incorporate with each having several tasks for you and your team members to complete. The four steps are as follows: Project kick-off Project analysis Synchronization exercise Project execution Let’s cover each of these in detail. I will provide what I like to refer to as a “Quad Chart” for each process step along with appropriate commentary. Project kick-off I recommend you lead the kick-off meeting to ensure all team members understand and agree to the upcoming process steps. You should place emphasis on the importance of completing the pre-work and understanding of key definitions, particularly around facts and critical assumptions. The operational definitions are as follows: Facts: Data or information that will likely have an impact on the project Critical assumptions: Valid and necessary suppositions in the absence of facts that, if proven false, would adversely impact planning or execution It is an excellent practice to link facts and assumptions. Here is an example of how that would work: It is a FACT that the Information Technology is beta-testing cloud-based solutions. We must ASSUME for planning purposes, that we can operate machine learning solutions on the cloud by the fourth quarter of this year. See, we’ve linked a fact and an assumption together and if this cloud-based solution is not available, let’s say it would negatively impact our ability to scale-up our machine learning solutions. If so, then you may want to have a contingency plan of some sort already thought through and prepared for implementation. Don’t worry if you haven’t thought of all possible assumptions or if you end up with a list of dozens. The synchronization exercise will help in identifying and prioritizing them. In my experience, identifying and tracking 10 critical assumptions at the project level is adequate. The following is the quad chart for this process step: Figure 1: Project kick-off quad chart Notice what is likely a new term, “Synchronization Matrix”. That is merely the tool used by the team to capture notes during the Synchronization Exercise. What you are doing is capturing time and events on the X-axis, and functions and terms on the Y-axis. Of course, this is highly customizable based on the specific circumstances and we will discuss more about it in process step number 3, that is Synchronization exercise, but here is an abbreviated example: Figure 2: Synchronization matrix example You can see in the matrix that I’ve included a row to capture critical assumptions. I can’t understate how important it is to articulate, capture, and track them. In fact, this is probably my favorite quote on the subject: … flawed assumptions are the most common cause of flawed execution. Harvard Business Review, The High Performance Organization, July-August 2005 OK, I think I’ve made my point, so let’s look at the next process step. Project analysis At this step, the participants prepare by analyzing the situation, collecting data, and making judgements as necessary. The goal is for each participant of the Synchronization Exercise to come to that meeting fully prepared. A good technique is to provide project participants with a worksheet template for them to use to complete the pre-work. A team can complete this step either individually, collectively or both. Here is the quad chart for the process step: Figure 3: Project analysis quad chart Let me expand on a couple of points. The idea of a team member creating information requirements is quite important. These are often tied back to your critical assumptions. Take the example above of the assumption around fielding a cloud-based capability. Can you think of some information requirements that might have as a potential end-user? Furthermore, can you prioritize them? OK, having done that, can you think of a plan to acquire that information and confirm or deny the underlying critical assumption? Notice also how that ties together with decision points you or others may have to make and how they may trigger contingency plans. This may sound rather basic and simplistic, but unless people are asked to think like this, articulate their requirements, share the information don’t expect anything to change anytime soon. It will be business as usual and let me ask again, “how is that working out for you?”. There is opportunity in all that chaos, so embrace it, and in the next step you will see the magic happen. Synchronization exercise The focus and discipline of the participants determine the success of this process step. This is a wargame-type exercise where team members portray their plan over time. Now, everyone gets to see how their plan relates to or even inhibits someone else’s plan and vice versa. I’ve done this step several different ways, including building the matrix on software, but the method that has consistently produced the best results is to build the matrix on large paper and put it along a conference room wall. Then, have the participants, one at a time, use post-it notes to portray their key events.  For example, the marketing manager gets up to the wall and posts “Marketing Campaign One” in the first time phase, “Marketing Campaign Two” in the final time phase, along with “Propensity Models” in the information requirements block. Iterating by participant and by time/event leads to coordination and cooperation like nothing you’ve ever seen. Another method to facilitate the success of the meeting is to have a disinterested and objective third party “referee” the meeting. This will help to ensure that any issues are captured or resolved and the process products updated accordingly. After the exercise, team members can incorporate the findings to their individual plans. This is an example quad chart on the process step: Figure 4: Synchronization exercise quad chart I really like the idea of execution and performance metrics. Here is how to think about them: Execution metrics—are we doing things right? Performance metrics—are we doing the right things? As you see, execution is about plan implementation, while performance metrics are about determining if the plan is making a difference (yes, I know that can be quite a dangerous thing to measure). Finally, we come to the fourth step where everything comes together during the execution of the project plan. Project execution This is a continual step in the process where a team can utilize the synchronization products to maintain situational understanding of the itself, key stakeholders, and the competitive environment. It can determine and how plans are progressing and quickly react to opportunities and threats as necessary. I recommend you update and communicate changes to the documentation on a regular basis. When I was in pharmaceutical forecasting, it was imperative that I end the business week by updating the matrices on SharePoint, which were available to all pertinent team members. The following is the quad chart for this process step: Figure 5: Project execution quad chart Keeping up with the documentation is a quick and simple process for the most part, and by doing so you will keep people aligned and cooperating. Be aware that like everything else that is new in the world, initial exuberance and enthusiasm will start to wane after several weeks. That is fine as long as you keep the documentation alive and maintain systematic communication. You will soon find that behavior is changing without anyone even taking heed, which is probably the best way to actually change behavior. A couple of words of warning. Don’t expect everyone to embrace the process wholeheartedly, which is to say that office politics may create a few obstacles. Often, an individual or even an entire business function will withhold information as “information is power”, and by sharing information they may feel they are losing power. Another issue may rise where some people feel it is needlessly complex or unnecessary. A solution to these problems is to scale back the number of core team members and utilize stakeholder analysis and a communication plan to bring they naysayers slowly into the fold. Change is never easy, but necessary nonetheless. Summary In this article, I’ve covered, at a high-level, a successful and proven process to deliver machine learning projects that will drive business value. I developed it from my numerous years of planning and evaluating military operations, including a one-year stint as a strategic advisor to the Iraqi Oil Police, adapting it to the needs of any organization. Utilizing the Synchronization Process will help any team avoid the common pitfalls of projects and improve efficiency and decision-making. It will help you become an agent of change and create influence in an organization without positional power. Resources for Article: Further resources on this subject: Machine Learning with R [article] Machine Learning Using Spark MLlib [article] Welcome to Machine Learning Using the .NET Framework [article]
Read more
  • 0
  • 0
  • 2181

article-image-weblogic-server
Packt
15 Mar 2017
24 min read
Save for later

WebLogic Server

Packt
15 Mar 2017
24 min read
In this article by Adrian Ward, Christian Screen and Haroun Khan, the author of the book Oracle Business Intelligence Enterprise Edition 12c - Second Edition, will talk a little more in detail about the enterprise application server that is at the core of Oracle Fusion Middleware, WebLogic. Oracle WebLogic Server is a scalable, enterprise-ready Java Platform Enterprise Edition (Java EE) application server. Its infrastructure supports the deployment of many types of distributed applications. It is also an ideal foundation for building service-oriented applications (SOA). You can already see why BEA was a perfect acquisition for Oracle years ago. Or, more to the point, a perfect core for Fusion Middleware. (For more resources related to this topic, see here.) The WebLogic Server is a robust application in itself. In Oracle BI 12c, the WebLogic server is crucial to the overall implementation, not just from installation but throughout the Oracle BI 12c lifecycle, which now takes advantage of the WebLogic Management Framework. Learning the management components of WebLogic Server that ultimately control the Oracle BI components is critical to the success of an implementation. These management areas within the WebLogic Server are referred to as the WebLogic Administration Server, WebLogic Manager Server(s), and the WebLogic Node Manager. A Few WebLogic Server Nuances Before we move on to a description for each of those areas within WebLogic, it is also important to understand that the WebLogic Server software that is used for the installation of the Oracle BI product suite carries a limited license. Although the software itself is the full enterprise version and carries full functionality, the license that ships with Oracle BI 12c is not a full enterprise license for WebLogic Server for your organization to spin off other siloed JEE deployments on other non-OBIEE servers.: Clustered from the installation:            The WebLogic Server license provided with out-of-the-box Oracle BI 12c does not allow for horizontal scale-out. An enterprise WebLogic Server license needs be obtained for this advanced functionality. Contains an Embedded Web/HTTP Server, not Oracle HTTP Server (OHS): WebLogic Server does not contain a separate HTTP server with the installation. The Oracle BI Enterprise Deployment Guide (available on oracle.com ) discusses separating the Application Tier from the Web/HTTP tier, suggesting Oracle HTTP Server. These items are simply a few nuances of the product suite in relation to Oracle BI 12c. Most software products contain a short list such as this one. However, once you understand the nuances, the easier it will be to ensure that you have a more successful implementation. It also allows your team to be as prepared in advance as possible. Be sure to consult your Oracle sales representative to assist with licensing concerns. Despite these nuances, we highly recommended that, in order to learn more about the installation features, configuration options, administration, and maintenance of WebLogic, you not only research it in relation to Oracle BI, but also in relation to its standalone form. That is to say that there is much more information at large on the topic of WebLogic Server itself than WebLogic Server as it relates to Oracle BI. Understanding this approach to self-educating or web searching should provide you with more efficient results. WebLogic Domain The highest unit of management for controlling the WebLogic Server Installation is called a domain. A domain is a logically related group of WebLogic Server resources that you manage as a unit. A domain always includes, and is centrally managed by, one Administration Server. Additional WebLogic Server instances, which are controlled by the Administration Server for the domain, are called Managed Servers. The configuration for all the servers in the domain is stored in the configuration repository, the config.xml file, which resides on the machine hosting the Administration Server. Upon installing and configuring Oracle BI 12c, the domain, bi, is established within the WebLogic Server. This domain is the recommended name for each Oracle BI 12c implementation and should not be modified. The domain path for the bi domain may appear as ORACLE_HOME/user_projects/domains/bi. This directory for the bi domain is also referred to as the DOMAIN_HOME or BI_DOMAIN folder WebLogic Administration Server The WebLogic Server is an enterprise software suite that manages a myriad of application server components, mainly focusing on Java technology. It is also comprised of many ancillary components, which enables the software to scale well, and also makes it a good choice for distributed environments and high-availability. Clearly, it is good enough to be at the core of Oracle Fusion Middleware. One of the most crucial components of WebLogic Server is WebLogic Administration Server. When installing the WebLogic Server software, the Administration Server is automatically installed with it. It is the Administration Server that not only controls all subsequent WebLogic Server instances, called Managed Servers, but it also controls such aspects as authentication-provider security (for example, LDAP) and other application-server-related configurations. WebLogic Server installs on the operating system and ultimately runs as a service on that machine. The WebLogic Server can be managed in several ways. The two main methods are via the Graphical User Interface (GUI) web application called WebLogic Administration Console, or via command line using the WebLogic Scripting Tool (WLST). You access the Administration Console from any networked machine using a web-based client (that is, a web browser) that can communicate with the Administration Server through the network and/or firewall. The WebLogic Administration Server and the WebLogic Server are basically synonymous. If the WebLogic Server is not running, the WebLogic Administration Console will be unavailable as well. WebLogic Managed Server Web applications, Enterprise Java Beans (EJB), and other resources are deployed onto one or more Managed Servers in a WebLogic Domain. A managed server is an instance of a WebLogic Server in a WebLogic Server domain. Each WebLogic Server domain has at least one instance, which acts as the Administration Server just discussed. One administration server per domain must exist, but one or more managed servers may exist in the WebLogic Server domain. In a production deployment, Oracle BI is deployed into its own managed server. The Oracle BI installer installs two WebLogic server instances, the Admin Server and a managed server, bi_server1. Oracle BI is deployed into the managed server bi_server1, and is configured by default to resolve to port 19502; the Admin Server resolves to port 19500. Historically, this has been port 9704 for the Oracle BI managed server, and port 7001 for the Admin Server. When administering the WebLogic Server via the Administration Console, the WebLogic Administration Server instance appears in the same list of servers, which also includes any managed servers. As a best practice, the WebLogic Administration Server should be used for configuration and management of the WebLogic Server only, and not contain any additionally deployed applications, EJBs, and so on. One thing to note is that the Enterprise Manager Fusion Control is actually a JEE application deployed to the Administration Server instance, which is why its web client is accessible under the same port as the Admin Server. It is not necessarily a native application deployment to the core WebLogic Server, but gets deployed and configured during the Oracle BI installation and configuration process automatically. In the deployments page within the Administration Console, you will find a deployment named em. WebLogic Node Manager The general idea behind Node Manager is that it takes on somewhat of a middle-man role. That is to say, the Node Manager provides a communication tunnel between the WebLogic Administration Server and any Managed Servers configured within the WebLogic Domain. When the WebLogic Server environment is contained on a single physical server, it may be difficult to recognize the need for a Node Manager. It is very necessary and, as part of any of your ultimate start-up and shutdown scripts for Oracle BI, the Node Manager lifecycle management will have to be a part of that process. Node Manager’s real power comes into play when Oracle BI is scaled out horizontally on one or more physical servers. Each scaled-out deployment of WebLogic Server will contain a Node Manager. If the Node Manager is not running on the server on which the Managed Server is deployed, then the core Administration Server will not be able to issue start or stop commands to that server. As such, if the Node Manager is down, communication with the overall cluster will be affected. The following figure shows how machines A, B, and C are physically separated, each containing a Node Manager. You can see that the Administration Server communicates to the Node Managers, and not the Managed Servers, directly: System tools controlled by WebLogic We briefly discussed the WebLogic Administration Console, which controls the administrative configuration of the WebLogic Server Domain. This includes the components managed within it, such as security, deployed applications, and so on. The other management tool that provides control of the deployed Oracle BI application ancillary deployments, libraries, and several other configurations, is called the Enterprise Manager Fusion Middleware Control. This seems to be a long name for single web-based tool. As such, the name is often shortened to “Fusion Control” or “Enterprise Manager.” Reference to either abbreviated title in the context of Oracle BI should ensure fellow Oracle BI teammates understand what you mean. Security It would be difficult to discuss the overall architecture of Oracle BI without at least giving some mention to how the basics of security, authentication, and authorization are applied. By default, installing Oracle WebLogic Server provides a default Lightweight Directory Access Protocol (LDAP) server, referred to as the WebLogic Server Embedded LDAP server. This is a standards-compliant LDAP system, which acts as the default authentication method for out-of-the-box Oracle BI. Integration of secondary LDAP providers, such as Oracle Internet Directory (OID) or Microsoft Active Directory (MSAD), is crucial to leveraging most organizations' identity-management systems. The combination of multiple authentication providers is possible; in fact, it is commonplace. For example, a configuration may wish to have users that exist in both the Embedded LDAP server and MSAD to authenticate and have access to Oracle BI. Potentially, users may want another set of users to be stored in a relational database repository, or have a set of relational database tables control the authorization that users have in relation to the Oracle BI system. WebLogic Server provides configuration opportunities for each of these scenarios. Oracle BI security incorporates the Fusion Middleware Security model, Oracle Platform Security Services (OPSS). This has a positive influence over managing all aspects of Oracle BI, as it provides a very granular level of authorization and a large number of authentication and authorization-integration mechanisms. OPSS also introduces to Oracle BI the concept of managing privileges by application role instead of directly by user or group. It abides by open standards to integrate with security mechanisms that are growing in popularity, such as the Security Assertion Markup Language (SAML) 2.0. Other well-known single-sign-on mechanisms such as SiteMinder and Oracle Access Manager already have pre-configured integration points within Oracle BI Fusion Control.Oracle BI 12c and Oracle BI 11g security is managed differently than the legacy Oracle BI 10g versions. Oracle BI 12c no longer has backward compatibility for the legacy version of Oracle BI 10g, and focus should be to follow the new security configuration best practices of Oracle BI 12c: An Oracle BI best practice is to manage security by Application Roles. Understanding the differences between the Identity Store, Credential Store, and Policy Store is critical for advanced security configuration and maintenance. As of Oracle BI 12c, the OPSS metadata is now stored in a relational repository, which is installed as part of the RCU-schemas installation process that takes place prior to executing the Oracle BI 12c installation on the application server. Managing by Application Roles In Oracle BI 11g, the default security model is the Oracle Fusion Middleware security model, which has a very broad scope. A universal Information Technology security-administration best practice is to set permissions or privileges to a specific point of access on a group, and not individual users. The same idea applies here, except there is another enterprise-level of user, and even group, aggregation, called an Application Role.Application Roles can contain other application roles, groups, or individual users. Access privileges to a certain object, such as a folder, web page, or column should always be assigned to an application role. Application roles for Oracle BI can be managed in the Oracle Enterprise Manager Fusion Middleware Control interface. They can also be scripted using the WLST command-line interface. Security Providers Fusion Middleware security can seem complex at first, but knowing the correct terminology and understanding how the most important components communicate with each other and the application at large is extremely important as it relates to security management. Oracle BI uses three main repositories for accessing authentication and authorization information, all of which are explained in the following sections. Identity Store Identity Store is the authentication provider, which may also provide authorization metadata. A simple mnemonic here is that this store tells Oracle BI how to “Identify” any users attempting to access the system. An example of creating an Identity Store would be to configure an LDAP system such as Oracle Internet Directory or Microsoft Active Directory to reference users within an organization. These LDAP configurations are referred to as Authentication Providers. Credential Store The credential store is ultimately for advanced Oracle configurations. You may touch upon this when establishing an enterprise Oracle BI deployment, but not much thereafter, unless integrating the Oracle BI Action Framework or something equally as complex. Ultimately, the credential store does exactly what its name implies – it stores credentials. Specifically, it is used to store credentials of other applications, which the core application (that is, Oracle BI) may access at a later time without having to re-enter said credentials. An example of this would be integrating Oracle BI with the Oracle Enterprise Management (EPM) suite. In this example, let's pretend there is an internal requirement at Company XYZ for users to access an Oracle BI dashboard. Upon viewing said dashboard, if a report with discrepancies is viewed, the user requires the ability to click on a link, which opens an Oracle EPM Financial Report containing more details about the concern. If not all users accessing the Oracle BI dashboard have credentials to access to the Oracle EPM environment directly, how could they open and view the report without being prompted for credentials? The answer would be that the credential store would be configured with the credentials of a central user having access to the Oracle EPM environment. This central user's credentials (encrypted, of course) are passed along with the dashboard viewer's request and presto, access! Policy Store The policy store is quite unique to Fusion Middleware security and leverages a security standard referred to as XACML, which ultimately provides granular access and privilege control for an enterprise application. This is one of the reasons why managing by Application Roles becomes so important. It is the individual Application Roles that are assigned policies defining access to information within Oracle BI. Stated another way, the application privileges, such as the ability to administer the Oracle BI RPD, are assigned to a particular application role, and these associations are defined in the policy store. The following figure shows how each area of security management is controlled: These three types of security providers within Oracle Fusion Middleware are integral to Oracle BI architecture. Further recommended research on this topic would be to look at Oracle Fusion Middleware Security, OPSS, and the Application Development Framework (ADF). System Requirements The first thing to recognize with infrastructure requirements prior to deploying Oracle BI 12c is that its memory and processor requirements have increased since previous versions. The Java Application server, WebLogic Server, installs with the full version of its software (though under a limited/restricted license, as already discussed). A multitude of additional Java libraries and applications are also deployed. Be prepared for a recommended minimum 8 to16 GB Read Access Memory (RAM) requirement for an Enterprise deployment, and a 6 to 8 GB RAM minimum requirement for a workstation deployment. Client tools Oracle BI 12c has a separate client tools installation that requires Microsoft Windows XP or a more recent version of the Windows Operating System (OS). The Oracle BI 12c client tools provide the majority of client-to-server management capabilities required for normal day-to-day maintenance of the Oracle BI repository and related artefacts. The client-tools installation is usually reserved for Oracle BI developers who architect and maintain the Oracle BI metadata repository, better known as the RPD, which stems from its binary file extension (.rpd). The Oracle BI 12c client-tools installation provides each workstation with the Administration tool, Job Manager, and all command-line Application Programming Interface (API) executables. In Oracle BI 12c, a 64-bit Windows OS is a requirement for installing the Oracle BI Development Client tools. It has been observed that, with some initial releases of Oracle BI 12c client tools, the ODBC DSN connectivity does not work in Windows Server 2012. Therefore, utilizing Windows Server 2012 as a development environment will be ineffective if attempting to open the Administration Tool and connecting to the OBIEE Server in online mode. Multi-User Development Environment One of the key features when developing with Oracle BI is the ability for multiple metadata developers to develop simultaneously. Although the use of the term “simultaneously” can vary among the technical communities, the use of concurrent development within the Oracle BI suite requires Oracle BI's Multi-User Development Environment (MUD) configuration, or some other process developed by third-party Oracle partners. The MUD configuration itself is fairly straightforward and ultimately relies on the Oracle BI administrator’s ability to divide metadata modeling responsibilities into projects. Projects – which are usually defined and delineated by logical fact table definitions – can be assigned to one or more metadata developers. In previous versions of Oracle BI, a metadata developer could install the entire Oracle BI product suite on an up-to-date laptop or commodity desktop workstation and successfully develop, test, and deploy an Oracle BI metadata model. The system requirements of Oracle BI 12c require a significant amount of processor and RAM capacity in order to perform development efforts on a standard-issue workstation or laptop. If an organization currently leverages the Oracle BI multi-user development environment, or plans to with the current release, this raises a couple of questions: How do we get our developers the best environment suitable for developing our metadata? Do we need to procure new hardware? Microsoft Windows is a requirement for Oracle BI client tools. However, the Oracle BI client tool does not include the server component of the Oracle BI environment. It only allows for connecting from the developer's workstation to the Oracle BI server instance. In a multi-user development environment, this poses a serious problem as only one metadata repository (RPD) can exist on any one Oracle BI server instance at any given time. If two developers are working from their respective workstations at the same time and wish to see their latest modifications published in a rapid application development (RAD) cycle, this type of iterative effort fails, as one developer's published changes will overwrite the other’s in real-time. To resolve the issue there are two recommended solutions. The first is an obvious localized solution. This solution merely upgrades the Oracle BI developers’ workstations or laptops to comply with the minimum requirements for installing the full Oracle BI environment on said machines. This upgrade should be both memory- (RAM) and processor- (MHz) centric. 16GB+ RAM and a dual-core processor are recommended. A 64-bit operating system kernel is required. Without an upgraded workstation from which to work, Oracle BI metadata developers will sit at a disadvantage for general iterative metadata development, and will especially be disenfranchised if interfacing within a multi-user development environment. The second solution is one that takes advantage of virtual machines’ (VM) capacity within the organization. Virtual machines have become a staple within most Information Technology departments, as they are versatile and allow for speedy proposition of server environments. For this scenario, it is recommended to create a virtual-machine template of an Oracle BI environment from which to duplicate and “stand up” individual virtual machine images for each metadata developer on the Oracle BI development team. This effectively provides each metadata developer with their own Oracle BI development environment server, which contains the fully deployed Oracle BI server environment. Each developer then has the ability to develop and test iteratively by connecting to their assigned virtual server, without fear that their efforts will conflict with another developer's. The following figure illustrates how an Oracle BI MUD environment can leverage either upgraded developer-workstation hardware or VM images, to facilitate development: This article does not cover the installation, configuration, or best practices for developing in a MUD environment. However, the Oracle BI development team deserves a lot of credit for documenting these processes in unprecedented detail. The Oracle BI 11g MUD documentation provides a case study, which conveys best practices for managing a complex Oracle BI development lifecycle. When ready to deploy a MUD environment, it is highly recommended to peruse this documentation first. The information in this section merely seeks to convey best practice in facilitating a developer’s workstation when using MUD. Certifications matrix Oracle BI 12c largely complies with the overall Fusion Middleware infrastructure. This common foundation allows for a centralized model to communicate with operating systems, web servers, and other ancillary components that are compliant. Oracle does a good job of updating a certification matrix for each Fusion Middleware application suite per respective product release. The certification matrix for Oracle BI 12c can be found on the Oracle website at the following locations: http://www.oracle.com/technetwork/middleware/fusion-middleware/documentation/fmw-1221certmatrix-2739738.xlsx and http://www.oracle.com/technetwork/middleware/ias/downloads/fusion-certification-100350.html. The certification matrix document is usually provided in Microsoft Excel format and should be referenced before any project or deployment of Oracle BI begins. This will ensure that infrastructure components such as the selected operating system, web server, web browsers, LDAP server, and so on, will actually work when integrated with the product suite. Scaling out Oracle BI 12c There are several reasons why an organization may wish to expand their Oracle BI footprint. This can range anywhere from requiring a highly available environment to achieving high levels of concurrent usage over time. The number of total end users, the number of total concurrent end users, the volume of queries, the size of the underlying data warehouse, and cross-network latency are even more factors to consider. Scaling out an environment has the potential to solve performance issues and stabilize the environment. When scoping out the infrastructure for an Oracle BI deployment, there are several crucial decisions to be made. These decisions can be greatly assisted by preparing properly, using Oracle's recommended guides for clustering and deploying Oracle BI on an enterprise scale. Pre-Configuration Run-Down Configuring the Oracle BI product suite, specifically when involving scaling out or setting up high-availability (HA), takes preparation. Proactively taking steps to understand what it takes to correctly establish or pre-configure the infrastructure required to support any level of fault tolerance and high-availability is critical. Even if the decision to scale-out from the initial Oracle BI deployment hasn't been made, if the potential exists, proper planning is recommended. Shared Storage We would be remiss not to highlight one of the most important concepts of scaling out Oracle BI, specifically for High-Availability: shared storage. The idea of shared storage is that, in a fault-tolerance environment, there are binary files and other configuration metadata that needs to be shared across the nodes. If these common elements were not shared, then, if one node were to fail, there is a potential loss of data. Most importantly is that, in a highly available Oracle BI environment, there can be only one WebLogic Administration Server running for that environment at any one time. A HA configuration makes one Administration Server active while the other is passive. If the appropriate pre-configuration steps for shared storage (as well as other items in the high-availability guide) are not properly completed, one should not expect accurate results from their environment. OBIEE 12c requires you to modify the Singleton Data Directory (SDD) for your Oracle BI configuration found at ORACLE_HOME/user_projects/domains/bi/data, so that the files within that path are moved to a shared storage location that would be mounted to the scaled-out servers on which a HA configuration would be implemented. To change this, one would need to modify the ORACLE_HOME/user_projects/domains/bi/config/fmwconfig/bienv/core/bi-environment.xml file to set the path of the bi:singleton-data-directory element to the full path of the shared mounted file location that contains a copy of the bidata folder, which will be referenced by one ore more scaled-out HA Oracle 12c servers. For example, change the XML file element: <bi:singleton-data-directory>/oraclehome/user_projects/domains/bi/bidata/</bi:singleton-data-directory> To reflect a shared NAS or SAN mount whose folder names and structure are inline with the IT team’s standard naming conventions, where the /bidata folder is the folder from the main Oracle BI 12c instance that gets copied to the shared directory: <bi:singleton-data-directory>/mount02/obiee_shared_settings/bidata/</bi:singleton-data-directory> Clustering A major benefit of Oracle BI's ability to leverage WebLogic Server as the Java application server tier is that, per the default installation, Oracle BI gets established in a clustered architecture. There is no additional configuration necessary to set this architecture in motion. Clearly, installing Oracle BI on a single server only provides a single server with which to interface; however, upon doing so, Oracle BI is installed into a single-node clustered-application-server environment. Additional clustered nodes of Oracle BI can then be configured to establish and expand the server, either horizontally or vertically. Vertical vs Horizontal In respect to the enterprise architecture and infrastructure of the Oracle BI environment, a clustered environment can be expanded in one of two ways: horizontally (scale-out) and vertically (scale-up). A horizontal expansion is the typical expansion type when clustering. It is represented by installing and configuring the application on a separate physical server, with reference to the main server application. A vertical expansion is usually represented by expanding the application on the same physical server under which the main server application resides. A horizontally expanded system can then, additionally, be vertically expanded. There are benefits to both scaling options. The decision to scale the system one way or the other is usually predicated on the cost of additional physical servers, server limitations, peripherals such as memory or processors, or an increase in usage activity by end users. Some considerations that may be used to assess which approach is the best for your specific implementation might be as follows: Load-balancing capabilities and need for an Active-Active versus Active-Passive architecture Need for failover or high availability Costs for processor and memory enhancements versus the cost of new servers Anticipated increase in concurrent user queries Realized decrease in performance due to increase in user activity Oracle BI Server (System Component) Cluster Controller When discussing scaling out the Oracle BI Server cluster, it is a common mistake to confuse the WebLogic Server application clustering with the Oracle BI Server Cluster Controller. Currently, Oracle BI can only have a single metadata repository (RPD) reference associated with an Oracle BI Server deployment instance at any single point in time. Because of this, the Oracle BI Server engine leverages a failover concept, to ensure some level of high-availability exists when the environment is scaled. In an Oracle BI scaled-out clustered environment, a secondary node, which has an instance of Oracle BI installed, will contain a secondary Oracle BI Server engine. From the main Oracle BI Managed Server containing the primary Oracle BI Server instance, the secondary Oracle BI Server instance gets established as the failover server engine using the Oracle BI Server Cluster Controller. This configuration takes place in the Enterprise Manager Fusion Control console. Based on this configuration, the scaled-out Oracle BI Server engine acts in an active-passive mode. That is to say that, when the main Oracle BI server engine instance fails, the secondary, or passive, Oracle BI Server engine then becomes active to route requests and field queries. Summary This article proves very beneficial as an introductory document for the beginner about what WebLogic Server is. Resources for Article: Further resources on this subject: Oracle 12c SQL and PL/SQL New Features [article] Schema Validation with Oracle JDeveloper - XDK 11g [article] Creating external tables in your Oracle 10g/11g Database [article]
Read more
  • 0
  • 0
  • 3892

article-image-learn-data
Packt
09 Mar 2017
6 min read
Save for later

Learn from Data

Packt
09 Mar 2017
6 min read
In this article by Rushdi Shams, the author of the book Java Data Science Cookbook, we will cover recipes that use machine learning techniques to learn patterns from data. These patterns are at the centre of attention for at least three key machine-learning tasks: classification, regression, and clustering. Classification is the task of predicting a value from a nominal class. In contrast to classification, regression models attempt to predict a value from a numeric class. (For more resources related to this topic, see here.) Generating linear regression models Most of the linear regression modelling follows a general pattern—there will be many independent variables that will be collectively produce a result, which is a dependent variable. For instance, we can generate a regression model to predict the price of a house based on different attributes/features of a house (mostly numeric, real values) like its size in square feet, number of bedrooms, number of washrooms, importance of its location, and so on. In this recipe, we will use Weka’s Linear Regression classifier to generate a regression model. Getting ready In order to perform the recipes in this section, we will require the following: To download Weka, go to http://www.cs.waikato.ac.nz/ml/weka/downloading.html and you will find download options for Windows, Mac, and other operating systems such as Linux. Read through the options carefully and download the appropriate version. During the writing of this article, 3.9.0 was the latest version for the developers and as the author already had version 1.8 JVM installed in his 64-bit Windows machine, he has chosen to download a self-extracting executable for 64-bit Windows without a Java Virtual Machine (JVM)   After the download is complete, double-click on the executable file and follow on screen instructions. You need to install the full version of Weka. Once the installation is done, do not run the software. Instead, go to the directory where you have installed it and find the Java Archive File for Weka (weka.jar). Add this file in your Eclipse project as external library. If you need to download older versions of Weka for some reasons, all of them can be found at https://sourceforge.net/projects/weka/files/. Please note that there is a possibility that many of the methods from old versions are deprecated and therefore not supported any more. How to do it… In this recipe, the linear regression model we will be creating is based on the cpu.arff dataset that can be found in the data directory of the Weka installation directory. Our code will have two instance variables: the first variable will contain the data instances of cpu.arff file and the second variable will be our linear regression classifier. Instances cpu = null; LinearRegression lReg ; Next, we will be creating a method to load the ARFF file and assign the last attribute of the ARFF file as its class attribute. public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); cpu = source.getDataSet(); cpu.setClassIndex(cpu.numAttributes() - 1); } catch (Exception e1) { } } We will be creating a method to build the linear regression model. To do so, we simply need to call the buildClassifier() method of our linear regression variable. The model can directly be sent as parameter to System.out.println(). public void buildRegression(){ lReg = new LinearRegression(); try { lReg.buildClassifier(cpu); } catch (Exception e) { } System.out.println(lReg); } The complete code for the recipe is as follows: import weka.classifiers.functions.LinearRegression; import weka.core.Instances; import weka.core.converters.ConverterUtils.DataSource; public class WekaLinearRegressionTest { Instances cpu = null; LinearRegression lReg ; public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); cpu = source.getDataSet(); cpu.setClassIndex(cpu.numAttributes() - 1); } catch (Exception e1) { } } public void buildRegression(){ lReg = new LinearRegression(); try { lReg.buildClassifier(cpu); } catch (Exception e) { } System.out.println(lReg); } public static void main(String[] args) throws Exception{ WekaLinearRegressionTest test = new WekaLinearRegressionTest(); test.loadArff("path to the cpu.arff file"); test.buildRegression(); } } The output of the code is as follows: Linear Regression Model class = 0.0491 * MYCT + 0.0152 * MMIN + 0.0056 * MMAX + 0.6298 * CACH + 1.4599 * CHMAX + -56.075 Generating logistic regression models Weka has a class named Logistic that can be used for building and using a multinomial logistic regression model with a ridge estimator. Although original Logistic Regression does not deal with instance weights, the algorithm in Weka has been modified to handle the instance weights. In this recipe, we will use Weka to generate logistic regression model on iris dataset. How to do it… We will be generating a logistic regression model from the iris dataset that can be found in the data directory in the installed folder of Weka. Our code will have two instance variables: one will be containing the data instances of iris dataset and the other will be the logistic regression classifier.  Instances iris = null; Logistic logReg ; We will be using a method to load and read the dataset as well as assign its class attribute (the last attribute of iris.arff file): public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); iris = source.getDataSet(); iris.setClassIndex(iris.numAttributes() - 1); } catch (Exception e1) { } } Next, we will be creating the most important method of our recipe that builds a logistic regression classifier from the iris dataset: public void buildRegression(){ logReg = new Logistic(); try { logReg.buildClassifier(iris); } catch (Exception e) { } System.out.println(logReg); } The complete executable code for the recipe is as follows: import weka.classifiers.functions.Logistic; import weka.core.Instances; import weka.core.converters.ConverterUtils.DataSource; public class WekaLogisticRegressionTest { Instances iris = null; Logistic logReg ; public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); iris = source.getDataSet(); iris.setClassIndex(iris.numAttributes() - 1); } catch (Exception e1) { } } public void buildRegression(){ logReg = new Logistic(); try { logReg.buildClassifier(iris); } catch (Exception e) { } System.out.println(logReg); } public static void main(String[] args) throws Exception{ WekaLogisticRegressionTest test = new WekaLogisticRegressionTest(); test.loadArff("path to the iris.arff file "); test.buildRegression(); } } The output of the code is as follows: Logistic Regression with ridge parameter of 1.0E-8 Coefficients... Class Variable Iris-setosa Iris-versicolor =============================================== sepallength 21.8065 2.4652 sepalwidth 4.5648 6.6809 petallength -26.3083 -9.4293 petalwidth -43.887 -18.2859 Intercept 8.1743 42.637 Odds Ratios... Class Variable Iris-setosa Iris-versicolor =============================================== sepallength 2954196659.8892 11.7653 sepalwidth 96.0426 797.0304 petallength 0 0.0001 petalwidth 0 0 The interpretation of the results from the recipe is beyond the scope of this article. Interested readers are encouraged to see a Stack Overflow discussion here: http://stackoverflow.com/questions/19136213/how-to-interpret-weka-logistic-regression-output. Summary In this article, we have covered the recipes that use machine learning techniques to learn patterns from data. These patterns are at the centre of attention for at least three key machine-learning tasks: classification, regression, and clustering. Classification is the task of predicting a value from a nominal class. Resources for Article: Further resources on this subject: The Data Science Venn Diagram [article] Data Science with R [article] Data visualization [article]
Read more
  • 0
  • 0
  • 2620

article-image-reading-fine-manual
Packt
09 Mar 2017
34 min read
Save for later

Reading the Fine Manual

Packt
09 Mar 2017
34 min read
In this article by Simon Riggs, Gabriele Bartolini, Hannu Krosing, Gianni Ciolli, the authors of the book PostgreSQL Administration Cookbook - Third Edition, you will learn the following recipes: Reading The Fine Manual (RTFM) Planning a new database Changing parameters in your programs Finding the current configuration settings Which parameters are at nondefault settings? Updating the parameter file Setting parameters for particular groups of users The basic server configuration checklist Adding an external module to PostgreSQL Using an installed module Managing installed extensions (For more resources related to this topic, see here.) I get asked many questions about parameter settings in PostgreSQL. Everybody's busy and most people want a 5-minute tour of how things work. That's exactly what a Cookbook does, so we'll do our best. Some people believe that there are some magical parameter settings that will improve their performance, spending hours combing the pages of books to glean insights. Others feel comfortable because they have found some website somewhere that "explains everything", and they "know" they have their database configured OK. For the most part, the settings are easy to understand. Finding the best setting can be difficult, and the optimal setting may change over time in some cases. This article is mostly about knowing how, when, and where to change parameter settings. Reading The Fine Manual (RTFM) RTFM is often used rudely to mean "don't bother me, I'm busy", or it is used as a stronger form of abuse. The strange thing is that asking you to read a manual is most often very good advice. Don't flame the advisors back; take the advice! The most important point to remember is that you should refer to a manual whose release version matches that of the server on which you are operating. The PostgreSQL manual is very well-written and comprehensive in its coverage of specific topics. However, one of its main failings is that the "documents" aren't organized in a way that helps somebody who is trying to learn PostgreSQL. They are organized from the perspective of people checking specific technical points so that they can decide whether their difficulty is a user error or not. It sometimes answers "What?" but seldom "Why?" or "How?" I've helped write sections of the PostgreSQL documents, so I'm not embarrassed to steer you towards reading them. There are, nonetheless, many things to read here that are useful. How to do it… The main documents for each PostgreSQL release are available at http://www.postgresql.org/docs/manuals/. The most frequently accessed parts of the documents are as follows: SQL command reference, as well as client and server tools reference: http://www.postgresql.org/docs/current/interactive/reference.html Configuration: http://www.postgresql.org/docs/current/interactive/runtime-config.html Functions: http://www.postgresql.org/docs/current/interactive/functions.html You can also grab yourself a PDF version of the manual, which can allow easier searching in some cases. Don't print it! The documents are more than 2000 pages of A4-sized sheets. How it works… The PostgreSQL documents are written in SGML, which is similar to, but not the same as, XML. These files are then processed to generate HTML files, PDF, and so on. This ensures that all the formats have exactly the same content. Then, you can choose the format you prefer, and you can even compile it in other formats such as EPUB, INFO, and so on. Moreover, the PostgreSQL manual is actually a subset of the PostgreSQL source code, so it evolves together with the software. It is written by the same people who make PostgreSQL. Even more reasons to read it! There's more… More information is also available at http://wiki.postgresql.org. Many distributions offer packages that install static versions of the HTML documentation. For example, on Debian and Ubuntu, the docs for the most recent stable PostgreSQL version are named postgresql-9.6-docs (unsurprisingly). Planning a new database Planning a new database can be a daunting task. It's easy to get overwhelmed by it, so here, we present some planning ideas. It's also easy to charge headlong at the task as well, thinking that whatever you know is all you'll ever need to consider. Getting ready You are ready. Don't wait to be told what to do. If you haven't been told what the requirements are, then write down what you think they are, clearly labeling them as "assumptions" rather than "requirements"—we mustn't confuse the two things. Iterate until you get some agreement, and then build a prototype. How to do it… Write a document that covers the following items: Database design—plan your database design Calculate the initial database sizing Transaction analysis—how will we access the database? Look at the most frequent access paths What are the requirements for response times? Hardware configuration Initial performance thoughts—will all of the data fit into RAM? Choose the operating system and filesystem type How do we partition the disk? Localization plan Decide server encoding, locale, and time zone Access and security plan Identify client systems and specify required drivers Create roles according to a plan for access control Specify pg_hba.conf Maintenance plan—who will keep it working? How? Availability plan—consider the availability requirements checkpoint_timeout Plan your backup mechanism and test it High-availability plan Decide which form of replication you'll need, if any How it works… One of the most important reasons for planning your database ahead of time is that retrofitting some things is difficult. This is especially true of server encoding and locale, which can cause much downtime and exertion if we need to change them later. Security is also much more difficult to set up after the system is live. There's more… Planning always helps. You may know what you're doing, but others may not. Tell everybody what you're going to do before you do it to avoid wasting time. If you're not sure yet, then build a prototype to help you decide. Approach the administration framework as if it were a development task. Make a list of things you don't know yet, and work through them one by one. This is deliberately a very short recipe. Everybody has their own way of doing things, and it's very important not to be too prescriptive about how to do things. If you already have a plan, great! If you don't, think about what you need to do, make a checklist, and then do it. Changing parameters in your programs PostgreSQL allows you to set some parameter settings for each session or transaction. How to do it… You can change the value of a setting during your session, like this: SET work_mem = '16MB'; This value will then be used for every future transaction. You can also change it only for the duration of the "current transaction": SET LOCAL work_mem = '16MB'; The setting will last until you issue this command: RESET work_mem; Alternatively, you can issue the following command: RESET ALL; SETand RESET commands are SQL commands that can be issued from any interface. They apply only to PostgreSQL server parameters, but this does not mean that they affect the entire server. In fact, the parameters you can change with SET and RESET apply only to the current session. Also, note that there may be other parameters, such as JDBC driver parameters, that cannot be set in this way. How it works… Suppose you change the value of a setting during your session, for example, by issuing this command: SET work_mem = '16MB'; Then, the following will show up in the pg_settings catalog view: postgres=# SELECT name, setting, reset_val, source FROM pg_settings WHERE source = 'session'; name | setting | reset_val | source ----------+---------+-----------+--------- work_mem | 16384 | 1024 | session Until you issue this command: RESET work_mem; After issuing it, the setting returns to reset_val and the source returns to default: name | setting | reset_val | source ---------+---------+-----------+--------- work_mem | 1024 | 1024 | default There's more… You can change the value of a setting during your transaction as well, like this: SET LOCAL work_mem = '16MB'; Then, this will show up in the pg_settings catalog view: postgres=# SELECT name, setting, reset_val, source FROM pg_settings WHERE source = 'session'; name | setting | reset_val | source ----------+---------+-----------+--------- work_mem | 1024 | 1024 | session Huh? What happened to your parameter setting? The SET LOCAL command takes effect only for the transaction in which it was executed, which was just the SET LOCAL command in our case. We need to execute it inside a transaction block to be able to see the setting take hold, as follows: BEGIN; SET LOCAL work_mem = '16MB'; Here is what shows up in the pg_settings catalog view: postgres=# SELECT name, setting, reset_val, source FROM pg_settings WHERE source = 'session'; name | setting | reset_val | source ----------+---------+-----------+--------- work_mem | 16384 | 1024 | session You should also note that the value of source is session rather than transaction, as you might have been expecting. Finding the current configuration settings At some point, it will occur to you to ask, "What are the current configuration settings?" Most settings can be changed in more than one way, and some ways do not affect all users or all sessions, so it is quite possible to get confused. How to do it… Your first thought is probably to look in postgresql.conf, which is the configuration file, described in detail in the Updating the parameter file recipe. That works, but only as long as there is only one parameter file. If there are two, then maybe you're reading the wrong file! (How do you know?) So, the cautious and accurate way is not to trust a text file, but to trust the server itself. Moreover, you learned in the previous recipe, Changing parameters in your programs, that each parameter has a scope that determines when it can be set. Some parameters can be set through postgresql.conf, but others can be changed afterwards. So, the current value of configuration settings may have been subsequently changed. We can use the SHOW command like this: postgres=# SHOW work_mem; Its output is as follows: work_mem ---------- 1MB (1 row) However, remember that it reports the current setting at the time it is run, and that can be changed in many places. Another way of finding the current settings is to access a PostgreSQL catalog view named pg_settings: postgres=# x Expanded display is on. postgres=# SELECT * FROM pg_settings WHERE name = 'work_mem'; [ RECORD 1 ] -------------------------------------------------------- name | work_mem setting | 1024 unit | kB category | Resource Usage / Memory short_desc | Sets the maximum memory to be used for query workspaces. extra_desc | This much memory can be used by each internal sort operation and hash table before switching to temporary disk files. context | user vartype | integer source | default min_val | 64 max_val | 2147483647 enumvals | boot_val | 1024 reset_val | 1024 sourcefile | sourceline | Thus, you can use the SHOW command to retrieve the value for a setting, or you can access the full details via the catalog table. There's more… The actual location of each configuration file can be asked directly to the PostgreSQL server, as shown in this example: postgres=# SHOW config_file; This returns the following output: config_file ------------------------------------------ /etc/postgresql/9.4/main/postgresql.conf (1 row) The other configuration files can be located by querying similar variables, hba_file and ident_file. How it works… Each parameter setting is cached within each session so that we can get fast access to the parameter settings. This allows us to access the parameter settings with ease. Remember that the values displayed are not necessarily settings for the server as a whole. Many of those parameters will be specific to the current session. That's different from what you experience with many other database software, and is also very useful. Which parameters are at nondefault settings? Often, we need to check which parameters have been changed or whether our changes have correctly taken effect. In the previous two recipes, we have seen that parameters can be changed in several ways, and with different scope. You learned how to inspect the value of one parameter or get the full list of parameters. In this recipe, we will show you how to use SQL capabilities to list only those parameters whose value in the current session differs from the system-wide default value. This list is valuable for several reasons. First, it includes only a few of the 200-plus available parameters, so it is more immediate. Also, it is difficult to remember all our past actions, especially in the middle of a long or complicated session. Version 9.4 introduces the ALTER SYSTEM syntax, which we will describe in the next recipe, Updating the parameter file. From the viewpoint of this recipe, its behavior is quite different from all other setting-related commands; you run it from within your session and it changes the default value, but not the value in your session. How to do it… We write a SQL query that lists all parameter values, excluding those whose current value is either the default or set from a configuration file: postgres=# SELECT name, source, setting FROM pg_settings WHERE source != 'default' AND source != 'override' ORDER by 2, 1; The output is as follows: name | source | setting ----------------------------+----------------------+----------------- application_name | client | psql client_encoding | client | UTF8 DateStyle | configuration file | ISO, DMY default_text_search_config | configuration file | pg_catalog.english dynamic_shared_memory_type | configuration file | posix lc_messages | configuration file | en_GB.UTF-8 lc_monetary | configuration file | en_GB.UTF-8 lc_numeric | configuration file | en_GB.UTF-8 lc_time | configuration file | en_GB.UTF-8 log_timezone | configuration file | Europe/Rome max_connections | configuration file | 100 port | configuration file | 5460 shared_buffers | configuration file | 16384 TimeZone | configuration file | Europe/Rome max_stack_depth | environment variable | 2048 How it works… You can see from pg_settings which parameters have nondefault values and what the source of the current value is. The SHOW command doesn't tell you whether a parameter is set at a nondefault value. It just tells you the value, which isn't of much help if you're trying to understand what is set and why. If the source is a configuration file, then the sourcefile and sourceline columns are also set. These can be useful in understanding where the configuration came from. There's more… The setting column of pg_settings shows the current value, but you can also look at boot_val and reset_val. The boot_val parameter shows the value assigned when the PostgreSQL database cluster was initialized (initdb), while reset_val shows the value that the parameter will return to if you issue the RESET command. The max_stack_depth parameter is an exception because pg_settings says it is set by the environment variable, though it is actually set by ulimit -s on Linux and Unix systems. The max_stack_depth parameter just needs to be set directly on Windows. The time zone settings are also picked up from the OS environment, so you shouldn't need to set those directly. In older releases, pg_settings showed them as command-line settings. From version 9.1 onwards, they are written to postgresql.conf when the data directory is initialized, so they show up as configuration files. Updating the parameter file The parameter file is the main location for defining parameter values for the PostgreSQL server. All the parameters can be set in the parameter file, which is known as postgresql.conf. There are also two other parameter files: pg_hba.conf and pg_ident.conf. Both of these relate to connections and security. Getting ready First, locate postgresql.conf, as described earlier. How to do it… Some of the parameters take effect only when the server is first started. A typical example might be shared_buffers, which defines the size of the shared memory cache. Many of the parameters can be changed while the server is still running. After changing the required parameters, we issue a reload operation to the server, forcing PostgreSQL to reread the postgresql.conf file (and all other configuration files). There is a number of ways to do that, depending on your distribution and OS. The most common is to issue the following command: pg_ctl reload with the same OS user that runs the PostgreSQL server process. This assumes the default data directory; otherwise you have to specify the correct data directory with the -D option. As noted earlier, Debian and Ubuntu have a different multiversion architecture, so you should issue the following command instead: pg_ctlcluster 9.6 main reload On modern distributions you can also use systemd, as follows: sudo systemctl reload postgresql@9.6-main Some other parameters require a restart of the server for changes to take effect, for instance, max_connections, listen_addresses, and so on. The syntax is very similar to a reload operation, as shown here: pg_ctl restart For Debian and Ubuntu, use this command: pg_ctlcluster 9.6 main restart and with systemd: sudo systemctl restart postgresql@9.6-main The postgresql.conf file is a normal text file that can be simply edited. Most of the parameters are listed in the file, so you can just search for them and then insert the desired value in the right place. How it works… If you set the same parameter twice in different parts of the file, the last setting is what applies. This can cause lots of confusion if you add settings to the bottom of the file, so you are advised against doing that. The best practice is to either leave the file as it is and edit the values, or to start with a blank file and include only the values that you wish to change. I personally prefer a file with only the nondefault values. That makes it easier to see what's happening. Whichever method you use, you are strongly advised to keep all the previous versions of your .conf files. You can do this by copying, or you can use a version control system such as Git or SVN. There's more… The postgresql.conf file also supports an include directive. This allows the postgresql.conf file to reference other files, which can then reference other files, and so on. That may help you organize your parameter settings better, if you don't make it too complicated. If you are working with PostgreSQL version 9.4 or later, you can change the values stored in the parameter files directly from your session, with syntax such as the following: ALTER SYSTEM SET shared_buffers = '1GB'; This command will not actually edit postgresql.conf. Instead, it writes the new setting to another file named postgresql.auto.conf. The effect is equivalent, albeit in a safer way. The original configuration is never written, so it cannot be damaged in the event of a crash. If you mess up with too many ALTER SYSTEM commands, you can always delete postgresql.auto.conf manually and reload the configuration, or restart PostgreSQL, depending on what parameters you had changed. Setting parameters for particular groups of users PostgreSQL supports a variety of ways of defining parameter settings for various user groups. This is very convenient, especially to manage user groups that have different requirements. How to do it… For all users in the saas database, use the following commands: ALTER DATABASE saas SET configuration_parameter = value1; For a user named simon connected to any database, use this: ALTER ROLE Simon SET configuration_parameter = value2; Alternatively, you can set a parameter for a user only when connected to a specific database, as follows: ALTER ROLE Simon IN DATABASE saas SET configuration_parameter = value3; The user won't know that these have been executed specifically for them. These are default settings, and in most cases, they can be overridden if the user requires nondefault values. How it works… You can set parameters for each of the following: Database User (which is named role by PostgreSQL) Database/user combination Each of the parameter defaults is overridden by the one below it. In the preceding three SQL statements: If user hannu connects to the saas database, then value1 will apply If user simon connects to a database other than saas, then value2 will apply If user simon connects to the saas database, then value3 will apply PostgreSQL implements this in exactly the same way as if the user had manually issued the equivalent SET statements immediately after connecting. The basic server configuration checklist PostgreSQL arrives configured for use on a shared system, though many people want to run dedicated database systems. The PostgreSQL project wishes to ensure that PostgreSQL will play nicely with other server software, and will not assume that it has access to the full server resources. If you, as the system administrator, know that there is no other important server software running on this system, then you can crank up the values much higher. Getting ready Before we start, we need to know two sets of information: We need to know the size of the physical RAM that will be dedicated to PostgreSQL We need to know something about the types of applications for which we will use PostgreSQL How to do it… If your database is larger than 32 MB, then you'll probably benefit from increasing shared_buffers. You can increase this to much larger values, but remember that running out of memory induces many problems. For instance, PostgreSQL is able to store information to the disk when the available memory is too small, and it employs sophisticated algorithms to treat each case differently and to place each piece of data either in the disk or in the memory, depending on each use case. On the other hand, overstating the amount of available memory confuses such abilities and results in suboptimal behavior. For instance, if the memory is swapped to disk, then PostgreSQL will inefficiently treat all data as if it were the RAM. Another unfortunate circumstance is when the Linux Out-Of-Memory (OOM) killer terminates one of the various processes spawned by the PostgreSQL server. So, it's better to be conservative. It is good practice to set a low value in your postgresql.conf and increment slowly to ensure that you get the benefits from each change. If you increase shared_buffers and you're running on a non-Windows server, you will almost certainly need to increase the value of the SHMMAX OS parameter (and on some platforms, other parameters as well). On Linux, Mac OS, and FreeBSD, you will need to either edit the /etc/sysctl.conf file or use sysctl -w with the following values: For Linux, use kernel.shmmax=value For Mac OS, use kern.sysv.shmmax=value For FreeBSD, use kern.ipc.shmmax=value There's more… For more information, you can refer to http://www.postgresql.org/docs/9.6/static/kernel-resources.html#SYSVIPC. For example, on Linux, add the following line to /etc/sysctl.conf: kernel.shmmax=value Don't worry about setting effective_cache_size. It is much less important a parameter than you might think. There is no need for too much fuss selecting the value. If you're doing heavy write activity, then you may want to set wal_buffers to a much higher value than the default. In fact wal_buffers is set automatically from the value of shared_buffers, following a rule that fits most cases; however, it is always possible to specify an explicit value that overrides the computation for the very few cases where the rule is not good enough. If you're doing heavy write activity and/or large data loads, you may want to set checkpoint_segments higher than the default to avoid wasting I/O in excessively frequent checkpoints. If your database has many large queries, you may wish to set work_mem to a value higher than the default. However, remember that such a limit applies separately to each node in the query plan, so there is a real risk of overallocating memory, with all the problems discussed earlier. Ensure that autovacuum is turned on, unless you have a very good reason to turn it off—most people don't. Leave the settings as they are for now. Don't fuss too much about getting the settings right. You can change most of them later, so you can take an iterative approach to improving things.  And remember, don't touch the fsync parameter. It's keeping you safe. Adding an external module to PostgreSQL Another strength of PostgreSQL is its extensibility. Extensibility was one of the original design goals, going back to the late 1980s. Now, in PostgreSQL 9.6, there are many additional modules that plug into the core PostgreSQL server. There are many kinds of additional module offerings, such as the following: Additional functions Additional data types Additional operators Additional indexes Getting ready First, you'll need to select an appropriate module to install. The walk towards a complete, automated package management system for PostgreSQL is not over yet, so you need to look in more than one place for the available modules, such as the following: Contrib: The PostgreSQL "core" includes many functions. There is also an official section for add-in modules, known as "contrib" modules. They are always available for your database server, but are not automatically enabled in every database because not all users might need them. On PostgreSQL version 9.6, we will have more than 40 such modules. These are documented at http://www.postgresql.org/docs/9.6/static/contrib.html. PGXN: This is the PostgreSQL Extension Network, a central distribution system dedicated to sharing PostgreSQL extensions. The website started in 2010, as a repository dedicated to the sharing of extension files. Separate projects: These are large external projects, such as PostGIS, offering extensive and complex PostgreSQL modules. For more information, take a look at http://www.postgis.org/. How to do it… There are several ways to make additional modules available for your database server, as follows: Using a software installer Installing from PGXN Installing from a manually downloaded package Installing from source code Often, a particular module will be available in more than one way, and users are free to choose their favorite, exactly like PostgreSQL itself, which can be downloaded and installed through many different procedures. Installing modules using a software installer Certain modules are available exactly like any other software packages that you may want to install in your server. All main Linux distributions provide packages for the most popular modules, such as PostGIS, SkyTools, procedural languages other than those distributed with core, and so on. In some cases, modules can be added during installation if you're using a standalone installer application, for example, the OneClick installer, or tools such as rpm, apt-get, and YaST on Linux distributions. The same procedure can also be followed after the PostgreSQL installation, when the need for a certain module arrives. We will actually describe this case, which is way more common. For example, let's say that you need to manage a collection of Debian package files, and that one of your tasks is to be able to pick the latest version of one of them. You start by building a database that records all the package files. Clearly, you need to store the version number of each package. However, Debian version numbers are much more complex than what we usually call "numbers". For instance, on my Debian laptop, I currently have version 9.2.18-1.pgdg80+1 of the PostgreSQL client package. Despite being complicated, that string follows a clearly defined specification, which includes many bits of information, including how to compare two versions to establish which of them is older. Since this recipe discussed extending PostgreSQL with custom data types and operators, you might have already guessed that I will now consider a custom data type for Debian version numbers that is capable of tasks such as understanding the Debian version number format, sorting version numbers, choosing the latest version number in a given group, and so on. It turns out that somebody else already did all the work of creating the required PostgreSQL data type, endowed with all the useful accessories: comparison operators, input/output functions, support for indexes, and maximum/minimum aggregates. All of this has been packaged as a PostgreSQL extension, as well as a Debian package (not a big surprise), so it is just a matter of installing the postgresql-9.2-debversion package with a Debian tool such as apt-get, aptitude, or synaptic. On my laptop, that boils down to the command line: apt-get install postgresql-9.2-debversion This will download the required package and unpack all the files in the right locations, making them available to my PostgreSQL server. Installing modules from PGXN The PostgreSQL Extension Network, PGXN for short, is a website (http://pgxn.org) launched in late 2010 with the purpose of providing "a central distribution system for open source PostgreSQL extension libraries". Anybody can register and upload their own module, packaged as an extension archive. The website allows browsing available extensions and their versions, either via a search interface or from a directory of package and user names. The simple way is to use a command-line utility, called pgxnclient. It can be easily installed in most systems; see the PGXN website on how to do so. Its purpose is to interact with PGXN and take care of administrative tasks, such as browsing available extensions, downloading the package, compiling the source code, installing files in the proper place, and removing installed package files. Alternatively, you can download the extension files from the website and place them in the right place by following the installation instructions. PGXN is different from official repositories because it serves another purpose. Official repositories usually contain only seasoned extensions because they accept new software only after a certain amount of evaluation and testing. On the other hand, anybody can ask for a PGXN account and upload their own extensions, so there is no filter except requiring that the extension has an open source license and a few files that any extension must have. Installing modules from a manually downloaded package You might have to install a module that is correctly packaged for your system but is not available from the official package archives. For instance, it could be the case that the module has not been accepted in the official repository yet, or you could have repackaged a bespoke version of that module with some custom tweaks, which are so specific that they will never become official. Whatever the case, you will have to follow the installation procedure for standalone packages specific to your system. Here is an example with the Oracle compatibility module, described at http://postgres.cz/wiki/Oracle_functionality_(en): First, we get the package, say for PostgreSQL 8.4 on a 64-bit architecture, from http://pgfoundry.org/frs/download.php/2414/orafce-3.0.1-1.pg84.rhel5.x86_64.rpm. Then, we install the package in the standard way: rpm -ivh orafce-3.0.1-1.pg84.rhel5.x86_64.rpm If all the dependencies are met, we are done. I mentioned dependencies because that's one more potential problem in installing packages that are not officially part of the installed distribution—you can no longer assume that all software version numbers have been tested, all requirements are available, and there are no conflicts. If you get error messages that indicate problems in these areas, you may have to solve them yourself, by manually installing missing packages and/or uninstalling conflicting packages. Installing modules from source code In many cases, useful modules may not have full packaging. In these cases, you may need to install the module manually. This isn't very hard and it's a useful exercise that helps you understand what happens. Each module will have different installation requirements. There are generally two aspects of installing a module. They are as follows: Building the libraries (only for modules that have libraries) Installing the module files in the appropriate locations You need to follow the instructions for the specific module in order to build the libraries, if any are required. Installation will then be straightforward, and usually there will be a suitably prepared configuration file for the make utility so that you just need to type the following command: make install Each file will be copied to the right directory. Remember that you normally need to be a system superuser in order to install files on system directories. Once a library file is in the directory expected by the PostgreSQL server, it will be loaded automatically as soon as requested by a function. Modules such as auto_explain do not provide any additional user-defined function, so they won't be auto-loaded; that needs to be done manually by a superuser with a LOAD statement. How it works… PostgreSQL can dynamically load libraries in the following ways: Using the explicit LOAD command in a session Using the shared_preload_libraries parameter in postgresql.conf at server start At session start, using the local_preload_libraries parameter for a specific user, as set using ALTER ROLE PostgreSQL functions and objects can reference code in these libraries, allowing extensions to be bound tightly to the running server process. The tight binding makes this method suitable for use even in very high-performance applications, and there's no significant difference between additionally supplied features and native features. Using an installed module In this recipe, we will explain how to enable an installed module so that it can be used in a particular database. The additional types, functions, and so on will exist only in those databases where we have carried out this step. Although most modules require this procedure, there are actually a couple of notable exceptions. For instance, the auto_explain module mentioned earlier, which is shipped together with PostgreSQL, does not create any function, type or operator. To use it, you must load its object file using the LOAD command. From that moment, all statements longer than a configurable threshold will be logged together with their execution plan. In the rest of this recipe, we will cover all the other modules. They do not require a LOAD statement because PostgreSQL can automatically load the relevant libraries when they are required. As mentioned in the previous recipe, Adding an external module to PostgreSQL, specially packaged modules are called extensions in PostgreSQL. They can be managed with dedicated SQL commands.  Getting ready Suppose you have chosen to install a certain module among those available for your system (see the previous recipe, Adding an external module to PostgreSQL); all you need to know is the extension name.  How to do it…  Each extension has a unique name, so it is just a matter of issuing the following command: CREATE EXTENSION myextname; This will automatically create all the required objects inside the current database. For security reasons, you need to do so as a database superuser. For instance, if you want to install the dblink extension, type this: CREATE EXTENSION dblink; How it works… When you issue a CREATE EXTENSION command, the database server looks for a file named EXTNAME.control in the SHAREDIR/extension directory. That file tells PostgreSQL some properties of the extension, including a description, some installation information, and the default version number of the extension (which is unrelated to the PostgreSQL version number). Then, a creation script is executed in a single transaction, so if it fails, the database is unchanged. The database server also notes in a catalog table the extension name and all the objects that belong to it. Managing installed extensions In the last two recipes, we showed you how to install external modules in PostgreSQL to augment its capabilities. In this recipe, we will show you some more capabilities offered by the extension infrastructure. How to do it… First, we list all available extensions: postgres=# x on Expanded display is on. postgres=# SELECT * postgres-# FROM pg_available_extensions postgres-# ORDER BY name; -[ RECORD 1 ]-----+-------------------------------------------------- name | adminpack default_version | 1.0 installed_version | comment | administrative functions for PostgreSQL -[ RECORD 2 ]-----+-------------------------------------------------- name | autoinc default_version | 1.0 installed_version | comment | functions for autoincrementing fields (...) In particular, if the dblink extension is installed, then we see a record like this: -[ RECORD 10 ]----+-------------------------------------------------- name | dblink default_version | 1.0 installed_version | 1.0 comment | connect to other PostgreSQL databases from within a database Now, we can list all the objects in the dblink extension, as follows: postgres=# x off Expanded display is off. postgres=# dx+ dblink Objects in extension "dblink" Object Description --------------------------------------------------------------------- function dblink_build_sql_delete(text,int2vector,integer,text[]) function dblink_build_sql_insert(text,int2vector,integer,text[],text[]) function dblink_build_sql_update(text,int2vector,integer,text[],text[]) function dblink_cancel_query(text) function dblink_close(text) function dblink_close(text,boolean) function dblink_close(text,text) (...) Objects created as parts of extensions are not special in any way, except that you can't drop them individually. This is done to protect you from mistakes: postgres=# DROP FUNCTION dblink_close(text); ERROR: cannot drop function dblink_close(text) because extension dblink requires it HINT: You can drop extension dblink instead. Extensions might have dependencies too. The cube and earthdistance contrib extensions provide a good example, since the latter depends on the former: postgres=# CREATE EXTENSION earthdistance; ERROR: required extension "cube" is not installed postgres=# CREATE EXTENSION cube; CREATE EXTENSION postgres=# CREATE EXTENSION earthdistance; CREATE EXTENSION As you can reasonably expect, dependencies are taken into account when dropping objects, just like for other objects: postgres=# DROP EXTENSION cube; ERROR: cannot drop extension cube because other objects depend on it DETAIL: extension earthdistance depends on extension cube HINT: Use DROP ... CASCADE to drop the dependent objects too. postgres=# DROP EXTENSION cube CASCADE; NOTICE: drop cascades to extension earthdistance DROP EXTENSION How it works… The pg_available_extensions system view shows one row for each extension control file in the SHAREDIR/extension directory (see the Using an installed module recipe). The pg_extension catalog table records only the extensions that have actually been created. The psql command-line utility provides the dx meta-command to examine e"x"tensions. It supports an optional plus sign (+) to control verbosity and an optional pattern for the extension name to restrict its range. Consider the following command: dx+ db* This will list all extensions whose name starts with db, together with all their objects. The CREATE EXTENSION command creates all objects belonging to a given extension, and then records the dependency of each object on the extension in pg_depend. That's how PostgreSQL can ensure that you cannot drop one such object without dropping its extension. The extension control file admits an optional line, requires, that names one or more extensions on which the current one depends. The implementation of dependencies is still quite simple. For instance, there is no way to specify a dependency on a specific version number of other extensions, and there is no command that installs one extension and all its prerequisites. As a general PostgreSQL rule, the CASCADE keyword tells the DROP command to delete all the objects that depend on cube, the earthdistance extension in this example. There's more… Another system view, pg_available_extension_versions, shows all the versions available for each extension. It can be valuable when there are multiple versions of the same extension available at the same time, for example, when making preparations for an extension upgrade. When a more recent version of an already installed extension becomes available to the database server, for instance because of a distribution upgrade that installs updated package files, the superuser can perform an upgrade by issuing the following command: ALTER EXTENSION myext UPDATE TO '1.1'; This assumes that the author of the extension taught it how to perform the upgrade. Starting from version 9.6, the CASCADE option is accepted also by the CREATE EXTENSION syntax, with the meaning of “issue CREATE EXTENSION recursively to cover all dependencies”. So, instead of creating extension cube before creating extension earthdistance, you could have just issued the following command: postgres=# CREATE EXTENSION earthdistance CASCADE; NOTICE: installing required extension "cube" CREATE EXTENSION Remember that CREATE EXTENSION … CASCADE will only work if all the extensions it tries to install have already been placed in the appropriate location. Summary In this article, we studies about RTFM, how to plan new database, changing parameters in the program, changing configurations, updating parameter files, how to use a module which is already installed. Resources for Article: Further resources on this subject: PostgreSQL in Action [article] PostgreSQL Cookbook - High Availability and Replication [article] Running a PostgreSQL Database Server [article]
Read more
  • 0
  • 0
  • 1708
article-image-what-d3js
Packt
08 Mar 2017
13 min read
Save for later

What is D3.js?

Packt
08 Mar 2017
13 min read
In this article by Ændrew H. Rininsland, the author of the book Learning D3.JS 4.x Data Visualization, we'll see what is new in D3 v4 and get started with Node and Git on the command line. (For more resources related to this topic, see here.) D3 (Data-Driven Documents), developed by Mike Bostock and the D3 community since 2011, is the successor to Bostock's earlier Protovis library. It allows pixel-perfect rendering of data by abstracting the calculation of things such as scales and axes into an easy-to-use domain-specific language (DSL), and uses idioms that should be immediately familiar to anyone with experience of using the popular jQuery JavaScript library. Much like jQuery, in D3, you operate on elements by selecting them and then manipulating via a chain of modifier functions. Especially within the context of data visualization, this declarative approach makes using it easier and more enjoyable than a lot of other tools out there. The official website, https://d3js.org/, features many great examples that show off the power of D3, but understanding them is tricky to start with. After finishing this book, you should be able to understand D3 well enough to figure out the examples, tweaking them to fit your needs. If you want to follow the development of D3 more closely, check out the source code hosted on GitHub at https://github.com/d3. The fine-grained control and its elegance make D3 one of the most powerful open source visualization libraries out there. This also means that it's not very suitable for simple jobs such as drawing a line chart or two-in that case you might want to use a library designed for charting. Many use D3 internally anyway. For a massive list, visit https://github.com/sorrycc/awesome-javascript#data-visualization. D3 is ultimately based around functional programming principles, which is currently experience a renaissance in the JavaScript community. This book really isn't about functional programming, but a lot of what we'll be doing will seem really familiar if you've ever used functional programming principles before. What happened to all the classes?! The second edition of this book contained quite a number of examples using the new class feature that is new in ES2015. The revised examples in this edition all use factory functions instead, and the class keyword never appears. Why is this, exactly? ES2015 classes are essentially just syntactic sugaring for factory functions. By this I mean that they ultimately compile down to that anyway. While classes can provide a certain level of organization to a complex piece of code, they ultimately hide what is going on underneath it all. Not only that, using OO paradigms like classes are effectively avoiding one of the most powerful and elegant aspects of JavaScript as a language, which is its focus on first-class functions and objects. Your code will be simpler and more elegant using functional paradigms than OO, and you'll find it less difficult to read examples in the D3 community, which almost never use classes. There are many, much more comprehensive arguments against using classes than I'm able to make here. For one of the best, please read Eric Elliott's excellent "The Two Pillars of JavaScript" pieces, at medium.com/javascript-scene/the-two-pillars-of-javascript-ee6f3281e7f3. What's new in D3 v4? One of the key changes to D3 since the last edition of this book is the release of version 4. Among its many changes, the most significant is a complete overhaul of the D3 namespace. This means that none of the examples in this book will work with D3 3.x, and the examples from the last book will not work with D3 4.x. This is quite possibly the cruelest thing Mr. Bostock could ever do to educational authors such as myself (I am totally joking here). Kidding aside, it also means many of the "block" examples in the D3 community are out-of-date and may appear rather odd if this book is your first encounter with the library. For this reason, it is very important to note the version of D3 an example uses - if it uses 3.x, it might be worth searching for a 4.x example just to prevent this cognitive dissonance. Related to this is how D3 has been broken up from a single library into many smaller libraries. There are two approaches you can take: you can use D3 as a single library in much the same way as version 3, or you can selectively use individual components of D3 in your project. This book takes the latter route, even if it does take a bit more effort - the benefit is primarily in that you'll have a better idea of how D3 is organized as a library and it reduces the size of the final bundle people who view your graphics will have to download. What's ES2017? One of the main changes to this book since the first edition is the emphasis on modern JavaScript; in this case, ES2017. Formerly known as ES6 (Harmony), it pushes the JavaScript language's features forward significantly, allowing for new usage patterns that simplify code readability and increase expressiveness. If you've written JavaScript before and the examples in this article look pretty confusing, it means you're probably familiar with the older, more common ES5 syntax. But don't sweat! It really doesn't take too long to get the hang of the new syntax, and I will try to explain the new language features as we encounter them. Although it might seem a somewhat steep learning curve at the start, by the end, you'll have improved your ability to write code quite substantially and will be on the cutting edge of contemporary JavaScript development. For a really good rundown of all the new toys you have with ES2016, check out this nice guide by the folks at Babel.js, which we will use extensively throughout this book: https://babeljs.io/docs/learn-es2015/. Before I go any further, let me clear some confusion about what ES2017 actually is. Initially, the ECMAScript (or ES for short) standards were incremented by cardinal numbers, for instance, ES4, ES5, ES6, and ES7. However, with ES6, they changed this so that a new standard is released every year in order to keep pace with modern development trends, and thus we refer to the year (2017) now. The big release was ES2015, which more or less maps to ES6. ES2016 was ratified in June 2016, and builds on the previous year's standard, while adding a few fixes and two new features. ES2017 is currently in the draft stage, which means proposals for new features are being considered and developed until it is ratified sometime in 2017. As a result of this book being written while these features are in draft, they may not actually make it into ES2017 and thus need to wait until a later standard to be officially added to the language. You don't really need to worry about any of this, however, because we use Babel.js to transpile everything down to ES5 anyway, so it runs the same in Node.js and in the browser. I try to refer to the relevant spec where a feature is added when I introduce it for the sake of accuracy (for instance, modules are an ES2015 feature), but when I refer to JavaScript, I mean all modern JavaScript, regardless of which ECMAScript spec it originated in. Getting started with Node and Git on the command line I will try not to be too opinionated in this book about which editor or operating system you should use to work through it (though I am using Atom on Mac OS X), but you are going to need a few prerequisites to start. The first is Node.js. Node is widely used for web development nowadays, and it's actually just JavaScript that can be run on the command line. Later on in this book, I'll show you how to write a server application in Node, but for now, let's just concentrate on getting it and npm (the brilliant and amazing package manager that Node uses) installed. If you're on Windows or Mac OS X without Homebrew, use the installer at https://nodejs.org/en/. If you're on Mac OS X and are using Homebrew, I would recommend installing "n" instead, which allows you to easily switch between versions of Node: $ brew install n $ n latest Regardless of how you do it, once you finish, verify by running the following lines: $ node --version $ npm --version If it displays the versions of node and npm it means you're good to go. I'm using 6.5.0 and 3.10.3, respectively, though yours might be slightly different-- the key is making sure node is at least version 6.0.0. If it says something similar to Command not found, double-check whether you've installed everything correctly, and verify that Node.js is in your $PATH environment variable. In the last edition of this book, we did a bunch of annoying stuff with Webpack and Babel and it was a bit too configuration-heavy to adequately explain. This time around we're using the lovely jspm for everything, which handles all the finicky annoying stuff for us. Install it now, using npm: npm install -g jspm@beta jspm-server This installs the most up-to-date beta version of jspm and the jspm development server. We don't need Webpack this time around because Rollup (which is used to bundle D3 itself) is used to bundle our projects, and jspm handles our Babel config for us. How helpful! Next, you'll want to clone the book's repository from GitHub. Change to your project directory and type this: $ git clone https://github.com/aendrew/learning-d3-v4 $ cd $ learning-d3-v4 This will clone the development environment and all the samples in the learning-d3-v4/ directory, as well as switch you into it. Another option is to fork the repository on GitHub and then clone your fork instead of mine as was just shown. This will allow you to easily publish your work on the cloud, enabling you to more easily seek support, display finished projects on GitHub Pages, and even submit suggestions and amendments to the parent project. This will help us improve this book for future editions. To do this, fork aendrew/learning-d3-v4 by clicking the "fork" button on GitHub, and replace aendrew in the preceding code snippet with your GitHub username. To switch between them, type the following command: $ git checkout <folder name> Replace <folder name> with the appropriate name of your folder. Stay at master for now though. To get back to it, type this line: $ git stash save && git checkout master The master branch is where you'll do a lot of your coding as you work through this book. It includes a prebuilt config.js file (used by jspm to manage dependencies), which we'll use to aid our development over the course of this book. We still need to install our dependencies, so let's do that now: $ npm install All of the source code that you'll be working on is in the lib/ folder. You'll notice it contains a just a main.js file; almost always, we'll be working in main.js, as index.html is just a minimal container to display our work in. This is it in its entirety, and it's the last time we'll look at any HTML in this book: <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Learning D3</title> </head> <body> <script src="jspm_packages/system.js"></script> <script src="config.js"></script> <script> System.import('lib/main.js'); </script> </body> </html> There's also an empty stylesheet in styles/index.css, which we'll add to in a bit. To get things rolling, start the development server by typing the following line: $ npm start This starts up the jspm development server, which will transform our new-fangled ES2017 JavaScript into backwards-compatible ES5, which can easily be loaded by most browsers. Instead of loading in a compiled bundle, we use SystemJS directly and load in main.js. When we're ready for production, we'll use jspm bundle to create an optimized JS payload. Now point Chrome (or whatever, I'm not fussy - so long as it's not Internet Explorer!) to localhost:8080 and fire up the developer console ( Ctrl +  Shift + J for Linux and Windows and option + command + J for Mac). You should see a blank website and a blank JavaScript console with a Command Prompt waiting for some code: A quick Chrome Developer Tools primer Chrome Developer Tools are indispensable to web development. Most modern browsers have something similar, but to keep this book shorter, we'll stick to just Chrome here for the sake of simplicity. Feel free to use a different browser. Firefox's Developer Edition is particularly nice, and - yeah yeah, I hear you guys at the back; Opera is good too! We are mostly going to use the Elements and Console tabs, Elements to inspect the DOM and Console to play with JavaScript code and look for any problems. The other six tabs come in handy for large projects: The Network tab will let you know how long files are taking to load and help you inspect the Ajax requests. The Profiles tab will help you profile JavaScript for performance. The Resources tab is good for inspecting client-side data. Timeline and Audits are useful when you have a global variable that is leaking memory and you're trying to work out exactly why your library is suddenly causing Chrome to use 500 MB of RAM. While I've used these in D3 development, they're probably more useful when building large web applications with frameworks such as React and Angular. The main one you want to focus on, however, is Sources, which shows all the source code files that have been pulled in by the webpage. Not only is this useful in determining whether your code is actually loading, it contains a fully functional JavaScript debugger, which few mortals dare to use. While explaining how to debug code is kind of boring and not at the level of this article, learning to use breakpoints instead of perpetually using console.log to figure out what your code is doing is a skill that will take you far in the years to come. For a good overview, visit https://developers.google.com/web/tools/chrome-devtools/debug/breakpoints/step-code?hl=en Most of what you'll do with Developer Tools, however, is look at the CSS inspector at the right-hand side of the Elements tab. It can tell you what CSS rules are impacting the styling of an element, which is very good for hunting rogue rules that are messing things up. You can also edit the CSS and immediately see the results, as follows: Summary In this article, you learned what D3 is and took a glance at the core philosophy behind how it works. You also set up your computer for prototyping of ideas and to play with visualizations. Resources for Article: Further resources on this subject: Learning D3.js Mapping [article] Integrating a D3.js visualization into a simple AngularJS application [article] Simple graphs with d3.js [article]
Read more
  • 0
  • 0
  • 3038

article-image-data-pipelines
Packt
03 Mar 2017
17 min read
Save for later

Data Pipelines

Packt
03 Mar 2017
17 min read
In this article by Andrew Morgan, Antoine Amend, Matthew Hallett, David George, the author of the book Mastering Spark for Data Science, readers will learn how to construct a content registerand use it to track all input loaded to the system, and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process. Readers will learn how to construct a content registerand use it to track all input loaded to the system, and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process. In this article we will cover the following topics: Welcome the GDELT Dataset Data Pipelines Universal Ingestion Framework Real-time monitoring for new data Receiving Streaming Data via Kafka Registering new content and vaulting for tracking purposes Visualization of content metrics in Kibana - to monitor ingestion processes & data health   (For more resources related to this topic, see here.) Data Pipelines Even with the most basic of analytics, we always require some data. In fact, finding the right data is probably among the hardest problems to solve in data science (but that’s a whole topic for another book!). We have already seen that the way in which we obtain our data can be as simple or complicated as is needed. In practice, we can break this decision into two distinct areas: Ad-hoc and scheduled. Ad-hoc data acquisition is the most common method during prototyping and small scale analytics as it usually doesn’t require any additional software to implement - the user requires some data and simply downloads it from source as and when required. This method is often a matter of clicking on a web link and storing the data somewhere convenient, although the data may still need to be versioned and secure. Scheduled data acquisition is used in more controlled environments for large scale and production analytics, there is also an excellent case for ingesting a dataset into a data lake for possible future use. With Internet of Things (IoT) on the increase, huge volumes of data are being produced, in many cases if the data is not ingested now it is lost forever. Much of this data may not have an immediate or apparent use today, but could do in the future; so the mind-set is to gather all of the data in case it is needed and delete it later when sure it is not. It’s clear we need a flexible approach to data acquisition that supports a variety of procurement options. Universal Ingestion Framework There are many ways to approach data acquisition ranging from home grown bash scripts through to high-end commercial tools. The aim of this section is to introduce a highly flexible framework that we can use for small scale data ingest, and then grow as our requirements change - all the way through to a full corporately managed workflow if needed - that framework will be build using Apache NiFi. NiFi enables us to build large-scale integrated data pipelines that move data around the planet. In addition, it’s also incredibly flexible and easy to build simple pipelines - usually quicker even than using Bash or any other traditional scripting method. If an ad-hoc approach is taken to source the same dataset on a number of occasions, then some serious thought should be given as to whether it falls into the scheduled category, or at least whether a more robust storage and versioning setup should be introduced. We have chosen to use Apache NiFi as it offers a solution that provides the ability to create many, varied complexity pipelines that can be scaled to truly Big Data and IoT levels, and it also provides a great drag & drop interface (using what’s known as flow-based programming[1]). With patterns, templates and modules for workflow production, it automatically takes care of many of the complex features that traditionally plague developers such as multi-threading, connection management and scalable processing. For our purposes it will enable us to quickly build simple pipelines for prototyping, and scale these to full production where required. It’s pretty well documented and easy to get running https://nifi.apache.org/download.html, it runs in a browser and looks like this: https://en.wikipedia.org/wiki/Flow-based_programming We leave the installation of NiFi as an exercise for the reader - which we would encourage you to do - as we will be using it in the following section. Introducing the GDELT News Stream Hopefully, we have NiFi up and running now and can start to ingest some data. So let’s start with some global news media data from GDELT. Here’s our brief, taken from the GDELT website http://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/: “Within 15 minutes of GDELT monitoring a news report breaking anywhere the world, it has translated it, processed it to identify all events, counts, quotes, people, organizations, locations, themes, emotions, relevant imagery, video, and embedded social media posts, placed it into global context, and made all of this available via a live open metadata firehose enabling open research on the planet itself. [As] the single largest deployment in the world of sentiment analysis, we hope that by bringing together so many emotional and thematic dimensions crossing so many languages and disciplines, and applying all of it in realtime to breaking news from across the planet, that this will spur an entirely new era in how we think about emotion and the ways in which it can help us better understand how we contextualize, interpret, respond to, and understand global events.” In order to start consuming this open data, we’ll need to hook into that metadata firehose and ingest the news streams onto our platform.  How do we do this?  Let’s start by finding out what data is available. Discover GDELT Real-time GDELT publish a list of the latest files on their website - this list is updated every 15 minutes. In NiFi, we can setup a dataflow that will poll the GDELT website, source a file from this list and save it to HDFS so we can use it later. Inside the NiFi dataflow designer, create a HTTP connector by dragging a processor onto the canvas and selecting GetHTTP. To configure this processor, you’ll need to enter the URL of the file list as: http://data.gdeltproject.org/gdeltv2/lastupdate.txt And also provide a temporary filename for the file list you will download. In the example below, we’ve used the NiFi’s expression language to generate a universally unique key so that files are not overwritten (UUID()). It’s worth noting that with this type of processor (GetHTTP), NiFi supports a number of scheduling and timing options for the polling and retrieval. For now, we’re just going to use the default options and let NiFi manage the polling intervals for us. An example of latest file list from GDELT is shown below. Next, we will parse the URL of the GKG news stream so that we can fetch it in a moment. Create a Regular Expression parser by dragging a processor onto the canvas and selecting ExtractText. Now position the new processor underneath the existing one and drag a line from the top processor to the bottom one. Finish by selecting the success relationship in the connection dialog that pops up. This is shown in the example below. Next, let’s configure the ExtractText processor to use a regular expression that matches only the relevant text of the file list, for example: ([^ ]*gkg.csv.*) From this regular expression, NiFi will create a new property (in this case, called url) associated with the flow design, which will take on a new value as each particular instance goes through the flow. It can even be configured to support multiple threads. Again, this is example is shown below. It’s worth noting here that while this is a fairly specific example, the technique is deliberately general purpose and can be used in many situations. Our First GDELT Feed Now that we have the URL of the GKG feed, we fetch it by configuring an InvokeHTTP processor to use the url property we previously created as it’s remote endpoint, and dragging the line as before. All that remains is to decompress the zipped content with a UnpackContent processor (using the basic zip format) and save to HDFS using a PutHDFS processor, like so: Improving with Publish and Subscribe So far, this flow looks very “point-to-point”, meaning that if we were to introduce a new consumer of data, for example, a Spark-streaming job, the flow must be changed. For example, the flow design might have to change to look like this: If we add yet another, the flow must change again. In fact, each time we add a new consumer, the flow gets a little more complicated, particularly when all the error handling is added. This is clearly not always desirable, as introducing or removing consumers (or producers) of data, might be something we want to do often, even frequently. Plus, it’s also a good idea to try to keep your flows as simple and reusable as possible. Therefore, for a more flexible pattern, instead of writing directly to HDFS, we can publish to Apache Kafka. This gives us the ability to add and remove consumers at any time without changing the data ingestion pipeline. We can also still write to HDFS from Kafka if needed, possibly even by designing a separate NiFi flow, or connect directly to Kafka using Spark-streaming. To do this, we create a Kafka writer by dragging a processor onto the canvas and selecting PutKafka. We now have a simple flow that continuously polls for an available file list, routinely retrieving the latest copy of a new stream over the web as it becomes available, decompressing the content and streaming it record-by-record into Kafka, a durable, fault-tolerant, distributed message queue, for processing by spark-streaming or storage in HDFS. And what’s more, without writing a single line of bash! Content Registry We have seen in this article that data ingestion is an area that is often overlooked, and that its importance cannot be underestimated. At this point we have a pipeline that enables us to ingest data from a source, schedule that ingest and direct the data to our repository of choice. But the story does not end there. Now we have the data, we need to fulfil our data management responsibilities. Enter the content registry. We’re going to build an index of metadata related to that data we have ingested. The data itself will still be directed to storage (HDFS, in our example) but, in addition, we will store metadata about the data, so that we can track what we’ve received and understand basic information about it, such as, when we received it, where it came from, how big it is, what type it is, etc. Choices and More Choices The choice of which technology we use to store this metadata is, as we have seen, one based upon knowledge and experience. For metadata indexing, we will require at least the following attributes: Easily searchable Scalable Parallel write ability Redundancy There are many ways to meet these requirements, for example we could write the metadata to Parquet, store in HDFS and search using Spark SQL. However, here we will use Elasticsearch as it meets the requirements a little better, most notably because it facilitates low latency queries of our metadata over a REST API - very useful for creating dashboards. In fact, Elasticsearch has the advantage of integrating directly with Kibana, meaning it can quickly produce rich visualizations of our content registry. For this reason, we will proceed with Elasticsearch in mind. Going with the Flow Using our current NiFi pipeline flow, let’s fork the output from “Fetch GKG files from URL” to add an additional set of steps to allow us to capture and store this metadata in Elasticsearch. These are: Replace the flow content with our metadata model Capture the metadata Store directly in Elasticsearch Here’s what this looks like in NiFi: Metadata Model So, the first step here is to define our metadata model. And there are many areas we could consider, but let’s select a set that helps tackle a few key points from earlier discussions. This will provide a good basis upon which further data can be added in the future, if required. So, let’s keep it simple and use the following three attributes: File size Date ingested File name These will provide basic registration of received files. Next, inside the NiFi flow, we’ll need to replace the actual data content with this new metadata model. An easy way to do this, is to create a JSON template file from our model. We’ll save it to local disk and use it inside a FetchFile processor to replace the flow’s content with this skeleton object. This template will look something like: { "FileSize": SIZE, "FileName": "FILENAME", "IngestedDate": "DATE" } Note the use of placeholder names (SIZE, FILENAME, DATE) in place of the attribute values. These will be substituted, one-by-one, by a sequence of ReplaceText processors, that swap the placeholder names for an appropriate flow attribute using regular expressions provided by the NiFi Expression Language, for example DATE becomes ${now()}. The last step is to output the new metadata payload to Elasticsearch. Once again, NiFi comes ready with a processor for this; the PutElasticsearch processor. An example metadata entry in Elasticsearch: { "_index": "gkg", "_type": "files", "_id": "AVZHCvGIV6x-JwdgvCzW", "_score": 1, "source": { "FileSize": 11279827, "FileName": "20150218233000.gkg.csv.zip", "IngestedDate": "2016-08-01T17:43:00+01:00" } } Now that we have added the ability to collect and interrogate metadata, we now have access to more statistics that can be used for analysis. This includes: Time based analysis e.g. file sizes over time Loss of data, for example are there data “holes” in the timeline? If there is a particular analytic that is required, the NIFI metadata component can be adjusted to provide the relevant data points. Indeed, an analytic could be built to look at historical data and update the index accordingly if the metadata does not exist in current data. Kibana Dashboard We have mentioned Kibana a number of times in this article, now that we have an index of metadata in Elasticsearch, we can use the tool to visualize some analytics. The purpose of this brief section is to demonstrate that we can immediately start to model and visualize our data. In this simple example we have completed the following steps: Added the Elasticsearch index for our GDELT metadata to the “Settings” tab Selected “file size” under the “Discover” tab Selected Visualize for “file size” Changed the Aggregation field to “Range” Entered values for the ranges The resultant graph displays the file size distribution: From here we are free to create new visualizations or even a fully featured dashboard that can be used to monitor the status of our file ingest. By increasing the variety of metadata written to Elasticsearch from NiFi, we can make more fields available in Kibana and even start our data science journey right here with some ingest based actionable insights. Now that we have a fully-functioning data pipeline delivering us real-time feeds of data, how do we ensure data quality of the payload we are receiving?  Let’s take a look at the options. Quality Assurance With an initial data ingestion capability implemented, and data streaming onto your platform, you will need to decide how much quality assurance is required at the front door. It’s perfectly viable to start with no initial quality controls and build them up over time (retrospectively scanning historical data as time and resources allow). However, it may be prudent to install a basic level of verification to begin with. For example, basic checks such as file integrity, parity checking, completeness, checksums, type checking, field counting, overdue files, security field pre-population, denormalization, etc. You should take care that your up-front checks do not take too long. Depending on the intensity of your examinations and the size of your data, it’s not uncommon to encounter a situation where there is not enough time to perform all processing before the next dataset arrives. You will always need to monitor your cluster resources and calculate the most efficient use of time. Here are some examples of the type of rough capacity planning calculation you can perform: Example 1: Basic Quality Checking, No Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population) takes 4 minutes There are no other users on the compute cluster There are 10 minutes of resources available for other tasks. As there are no other users on the cluster, this is satisfactory - no action needs to be taken. Example 2: Advanced Quality Checking, No Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population, denormalization, sub dataset building) takes 13 minutes There are no other users on the compute cluster There is only 1 minute of resource available for other tasks. We probably need to consider, either: Configuring a resource scheduling policy Reducing the amount of data ingested Reducing the amount of processing we undertake Adding additional compute resources to the cluster Example 3: Basic Quality Checking, 50% Utility Due to Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population) takes 4 minutes (100% utility) There are other users on the compute cluster There are 6 minutes of resources available for other tasks (15 - 1 - (4 * (100 / 50))). Since there are other users there is a danger that, at least some of the time, we will not be able to complete our processing and a backlog of jobs will occur. When you run into timing issues, you have a number of options available to you in order to circumvent any backlog: Negotiating sole use of the resources at certain times Configuring a resource scheduling policy, including: YARN Fair Scheduler: allows you to define queues with differing priorities and target your Spark jobs by setting the spark.yarn.queue property on start-up so your job always takes precedence Dynamicandr Resource Allocation: allows concurrently running jobs to automatically scale to match their utilization Spark Scheduler Pool: allows you to define queues when sharing a SparkContext using multithreading model, and target your Spark job by setting the spark.scheduler.pool property per execution thread so your thread takes precedence Running processing jobs overnight when the cluster is quiet In any case, you will eventually get a good idea of how the various parts to your jobs perform and will then be in a position to calculate what changes could be made to improve efficiency. There’s always the option of throwing more resources at the problem, especially when using a cloud provider, but we would certainly encourage the intelligent use of existing resources - this is far more scalable, cheaper and builds data expertise. Summary In this article we walked through the full setup of an Apache NiFi GDELT ingest pipeline, complete with metadata forks and a brief introduction to visualizing the resultant data. This section is particularly important as GDELT is used extensively throughout the book and the NiFi method is a highly effective way to source data in a scalable and modular way. Resources for Article: Further resources on this subject: Integration with Continuous Delivery [article] Amazon Web Services [article] AWS Fundamentals [article]
Read more
  • 0
  • 1
  • 16359

article-image-numpy-array-object
Packt
03 Mar 2017
18 min read
Save for later

The NumPy array object

Packt
03 Mar 2017
18 min read
In this article by Armando Fandango author of the book Python Data Analysis - Second Edition, discuss how the NumPy provides a multidimensional array object called ndarray. NumPy arrays are typed arrays of fixed size. Python lists are heterogeneous and thus elements of a list may contain any object type, while NumPy arrays are homogenous and can contain object of only one type. An ndarray consists of two parts, which are as follows: The actual data that is stored in a contiguous block of memory The metadata describing the actual data Since the actual data is stored in a contiguous block of memory hence loading of the large data set as ndarray is affected by availability of large enough contiguous block of memory. Most of the array methods and functions in NumPy leave the actual data unaffected and only modify the metadata. Actually, we made a one-dimensional array that held a set of numbers. The ndarray can have more than a single dimension. (For more resources related to this topic, see here.) Advantages of NumPy arrays The NumPy array is, in general, homogeneous (there is a particular record array type that is heterogeneous)—the items in the array have to be of the same type. The advantage is that if we know that the items in an array are of the same type, it is easy to ascertain the storage size needed for the array. NumPy arrays can execute vectorized operations, processing a complete array, in contrast to Python lists, where you usually have to loop through the list and execute the operation on each element. NumPy arrays are indexed from 0, just like lists in Python. NumPy utilizes an optimized C API to make the array operations particularly quick. We will make an array with the arange() subroutine again. You will see snippets from Jupyter Notebook sessions where NumPy is already imported with instruction import numpy as np. Here's how to get the data type of an array: In: a = np.arange(5) In: a.dtype Out: dtype('int64') The data type of the array a is int64 (at least on my computer), but you may get int32 as the output if you are using 32-bit Python. In both the cases, we are dealing with integers (64 bit or 32 bit). Besides the data type of an array, it is crucial to know its shape. A vector is commonly used in mathematics but most of the time we need higher-dimensional objects. Let's find out the shape of the vector we produced a few minutes ago: In: a Out: array([0, 1, 2, 3, 4]) In: a.shape Out: (5,) As you can see, the vector has five components with values ranging from 0 to 4. The shape property of the array is a tuple; in this instance, a tuple of 1 element, which holds the length in each dimension. Creating a multidimensional array Now that we know how to create a vector, we are set to create a multidimensional NumPy array. After we produce the matrix, we will again need to show its, as demonstrated in the following code snippets: Create a multidimensional array as follows: In: m = np.array([np.arange(2), np.arange(2)]) In: m Out: array([[0, 1], [0, 1]]) We can show the array shape as follows: In: m.shape Out: (2, 2) We made a 2 x 2 array with the arange() subroutine. The array() function creates an array from an object that you pass to it. The object has to be an array, for example, a Python list. In the previous example, we passed a list of arrays. The object is the only required parameter of the array() function. NumPy functions tend to have a heap of optional arguments with predefined default options. Selecting NumPy array elements From time to time, we will wish to select a specific constituent of an array. We will take a look at how to do this, but to kick off, let's make a 2 x 2 matrix again: In: a = np.array([[1,2],[3,4]]) In: a Out: array([[1, 2], [3, 4]]) The matrix was made this time by giving the array() function a list of lists. We will now choose each item of the matrix one at a time, as shown in the following code snippet. Recall that the index numbers begin from 0: In: a[0,0] Out: 1 In: a[0,1] Out: 2 In: a[1,0] Out: 3 In: a[1,1] Out: 4 As you can see, choosing elements of an array is fairly simple. For the array a, we just employ the notation a[m,n], where m and n are the indices of the item in the array. Have a look at the following figure for your reference: NumPy numerical types Python has an integer type, a float type, and complex type; nonetheless, this is not sufficient for scientific calculations. In practice, we still demand more data types with varying precisions and, consequently, different storage sizes of the type. For this reason, NumPy has many more data types. The bulk of the NumPy mathematical types ends with a number. This number designates the count of bits related to the type. The following table (adapted from the NumPy user guide) presents an overview of NumPy numerical types: Type Description bool Boolean (True or False) stored as a bit inti Platform integer (normally either int32 or int64) int8 Byte (-128 to 127) int16 Integer (-32768 to 32767) int32 Integer (-2 ** 31 to 2 ** 31 -1) int64 Integer (-2 ** 63 to 2 ** 63 -1) uint8 Unsigned integer (0 to 255) uint16 Unsigned integer (0 to 65535) uint32 Unsigned integer (0 to 2 ** 32 - 1) uint64 Unsigned integer (0 to 2 ** 64 - 1) float16 Half precision float: sign bit, 5 bits exponent, and 10 bits mantissa float32 Single precision float: sign bit, 8 bits exponent, and 23 bits mantissa float64 or float Double precision float: sign bit, 11 bits exponent, and 52 bits mantissa complex64 Complex number, represented by two 32-bit floats (real and imaginary components) complex128 or complex Complex number, represented by two 64-bit floats (real and imaginary components) For each data type, there exists a matching conversion function: In: np.float64(42) Out: 42.0 In: np.int8(42.0) Out: 42 In: np.bool(42) Out: True In: np.bool(0) Out: False In: np.bool(42.0) Out: True In: np.float(True) Out: 1.0 In: np.float(False) Out: 0.0 Many functions have a data type argument, which is frequently optional: In: np.arange(7, dtype= np.uint16) Out: array([0, 1, 2, 3, 4, 5, 6], dtype=uint16) It is important to be aware that you are not allowed to change a complex number into an integer. Attempting to do that sparks off a TypeError: In: np.int(42.0 + 1.j) Traceback (most recent call last): <ipython-input-24-5c1cd108488d> in <module>() ----> 1 np.int(42.0 + 1.j) TypeError: can't convert complex to int The same goes for conversion of a complex number into a floating-point number. By the way, the j component is the imaginary coefficient of a complex number. Even so, you can convert a floating-point number to a complex number, for example, complex(1.0). The real and imaginary pieces of a complex number can be pulled out with the real() and imag() functions, respectively. Data type objects Data type objects are instances of the numpy.dtype class. Once again, arrays have a data type. To be exact, each element in a NumPy array has the same data type. The data type object can tell you the size of the data in bytes. The size in bytes is given by the itemsize property of the dtype class : In: a.dtype.itemsize Out: 8 Character codes Character codes are included for backward compatibility with Numeric. Numeric is the predecessor of NumPy. Its use is not recommended, but the code is supplied here because it pops up in various locations. You should use the dtype object instead. The following table lists several different data types and character codes related to them: Type Character code integer i Unsigned integer u Single precision float f Double precision float d bool b complex D string S unicode U Void V Take a look at the following code to produce an array of single precision floats: In: arange(7, dtype='f') Out: array([ 0., 1., 2., 3., 4., 5., 6.], dtype=float32) Likewise, the following code creates an array of complex numbers: In: arange(7, dtype='D') In: arange(7, dtype='D') Out: array([ 0.+0.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j, 6.+0.j]) The dtype constructors We have a variety of means to create data types. Take the case of floating-point data (have a look at dtypeconstructors.py in this book's code bundle): We can use the general Python float, as shown in the following lines of code: In: np.dtype(float) Out: dtype('float64') We can specify a single precision float with a character code: In: np.dtype('f') Out: dtype('float32') We can use a double precision float with a character code: In: np.dtype('d') Out: dtype('float64') We can pass the dtype constructor a two-character code. The first character stands for the type; the second character is a number specifying the number of bytes in the type (the numbers 2, 4, and 8 correspond to floats of 16, 32, and 64 bits, respectively): In: np.dtype('f8') Out: dtype('float64') A (truncated) list of all the full data type codes can be found by applying sctypeDict.keys(): In: np.sctypeDict.keys() In: np.sctypeDict.keys() Out: dict_keys(['?', 0, 'byte', 'b', 1, 'ubyte', 'B', 2, 'short', 'h', 3, 'ushort', 'H', 4, 'i', 5, 'uint', 'I', 6, 'intp', 'p', 7, 'uintp', 'P', 8, 'long', 'l', 'L', 'longlong', 'q', 9, 'ulonglong', 'Q', 10, 'half', 'e', 23, 'f', 11, 'double', 'd', 12, 'longdouble', 'g', 13, 'cfloat', 'F', 14, 'cdouble', 'D', 15, 'clongdouble', 'G', 16, 'O', 17, 'S', 18, 'unicode', 'U', 19, 'void', 'V', 20, 'M', 21, 'm', 22, 'bool8', 'Bool', 'b1', 'float16', 'Float16', 'f2', 'float32', 'Float32', 'f4', 'float64', ' Float64', 'f8', 'float128', 'Float128', 'f16', 'complex64', 'Complex32', 'c8', 'complex128', 'Complex64', 'c16', 'complex256', 'Complex128', 'c32', 'object0', 'Object0', 'bytes0', 'Bytes0', 'str0', 'Str0', 'void0', 'Void0', 'datetime64', 'Datetime64', 'M8', 'timedelta64', 'Timedelta64', 'm8', 'int64', 'uint64', 'Int64', 'UInt64', 'i8', 'u8', 'int32', 'uint32', 'Int32', 'UInt32', 'i4', 'u4', 'int16', 'uint16', 'Int16', 'UInt16', 'i2', 'u2', 'int8', 'uint8', 'Int8', 'UInt8', 'i1', 'u1', 'complex_', 'int0', 'uint0', 'single', 'csingle', 'singlecomplex', 'float_', 'intc', 'uintc', 'int_', 'longfloat', 'clongfloat', 'longcomplex', 'bool_', 'unicode_', 'object_', 'bytes_', 'str_', 'string_', 'int', 'float', 'complex', 'bool', 'object', 'str', 'bytes', 'a']) The dtype attributes The dtype class has a number of useful properties. For instance, we can get information about the character code of a data type through the properties of dtype: In: t = np.dtype('Float64') In: t.char Out: 'd' The type attribute corresponds to the type of object of the array elements: In: t.type Out: numpy.float64 The str attribute of dtype gives a string representation of a data type. It begins with a character representing endianness, if appropriate, then a character code, succeeded by a number corresponding to the number of bytes that each array item needs. Endianness, here, entails the way bytes are ordered inside a 32- or 64-bit word. In the big-endian order, the most significant byte is stored first, indicated by >. In the little-endian order, the least significant byte is stored first, indicated by <, as exemplified in the following lines of code: In: t.str Out: '<f8' One-dimensional slicing and indexing Slicing of one-dimensional NumPy arrays works just like the slicing of standard Python lists. Let's define an array containing the numbers 0, 1, 2, and so on up to and including 8. We can select a part of the array from indexes 3 to 7, which extracts the elements of the arrays 3 through 6: In: a = np.arange(9) In: a[3:7] Out: array([3, 4, 5, 6]) We can choose elements from indexes the 0 to 7 with an increment of 2: In: a[:7:2] Out: array([0, 2, 4, 6]) Just as in Python, we can use negative indices and reverse the array: In: a[::-1] Out: array([8, 7, 6, 5, 4, 3, 2, 1, 0]) Manipulating array shapes We have already learned about the reshape() function. Another repeating chore is the flattening of arrays. Flattening in this setting entails transforming a multidimensional array into a one-dimensional array. Let us create an array b that we shall use for practicing the further examples: In: b = np.arange(24).reshape(2,3,4) In: print(b) Out: [[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) We can manipulate array shapes using the following functions: Ravel: We can accomplish this with the ravel() function as follows: In: b Out: array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) In: b.ravel() Out: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]) Flatten: The appropriately named function, flatten(), does the same as ravel(). However, flatten() always allocates new memory, whereas ravel gives back a view of the array. This means that we can directly manipulate the array as follows: In: b.flatten() Out: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]) Setting the shape with a tuple: Besides the reshape() function, we can also define the shape straightaway with a tuple, which is exhibited as follows: In: b.shape = (6,4) In: b Out: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]) As you can understand, the preceding code alters the array immediately. Now, we have a 6 x 4 array. Transpose: In linear algebra, it is common to transpose matrices. Transposing is a way to transform data. For a two-dimensional table, transposing means that rows become columns and columns become rows. We can do this too by using the following code: In: b.transpose() Out: array([[ 0, 4, 8, 12, 16, 20], [ 1, 5, 9, 13, 17, 21], [ 2, 6, 10, 14, 18, 22], [ 3, 7, 11, 15, 19, 23]]) Resize: The resize() method works just like the reshape() method, In: b.resize((2,12)) In: b Out: array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]) Stacking arrays Arrays can be stacked horizontally, depth wise, or vertically. We can use, for this goal, the vstack(), dstack(), hstack(), column_stack(), row_stack(), and concatenate() functions. To start with, let's set up some arrays: In: a = np.arange(9).reshape(3,3) In: a Out: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In: b = 2 * a In: b Out: array([[ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) As mentioned previously, we can stack arrays using the following techniques: Horizontal stacking: Beginning with horizontal stacking, we will shape a tuple of ndarrays and hand it to the hstack() function to stack the arrays. This is shown as follows: In: np.hstack((a, b)) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) We can attain the same thing with the concatenate() function, which is shown as follows: In: np.concatenate((a, b), axis=1) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) The following diagram depicts horizontal stacking: Vertical stacking: With vertical stacking, a tuple is formed again. This time it is given to the vstack() function to stack the arrays. This can be seen as follows: In: np.vstack((a, b)) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) The concatenate() function gives the same outcome with the axis parameter fixed to 0. This is the default value for the axis parameter, as portrayed in the following code: In: np.concatenate((a, b), axis=0) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) Refer to the following figure for vertical stacking: Depth stacking: To boot, there is the depth-wise stacking employing dstack() and a tuple, of course. This entails stacking a list of arrays along the third axis (depth). For example, we could stack 2D arrays of image data on top of each other as follows: In: np.dstack((a, b)) Out: array([[[ 0, 0], [ 1, 2], [ 2, 4]], [[ 3, 6], [ 4, 8], [ 5, 10]], [[ 6, 12], [ 7, 14], [ 8, 16]]]) Column stacking: The column_stack() function stacks 1D arrays column-wise. This is shown as follows: In: oned = np.arange(2) In: oned Out: array([0, 1]) In: twice_oned = 2 * oned In: twice_oned Out: array([0, 2]) In: np.column_stack((oned, twice_oned)) Out: array([[0, 0], [1, 2]]) 2D arrays are stacked the way the hstack() function stacks them, as demonstrated in the following lines of code: In: np.column_stack((a, b)) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) In: np.column_stack((a, b)) == np.hstack((a, b)) Out: array([[ True, True, True, True, True, True], [ True, True, True, True, True, True], [ True, True, True, True, True, True]], dtype=bool) Yes, you guessed it right! We compared two arrays with the == operator. Row stacking: NumPy, naturally, also has a function that does row-wise stacking. It is named row_stack() and for 1D arrays, it just stacks the arrays in rows into a 2D array: In: np.row_stack((oned, twice_oned)) Out: array([[0, 1], [0, 2]]) The row_stack() function results for 2D arrays are equal to the vstack() function results: In: np.row_stack((a, b)) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) In: np.row_stack((a,b)) == np.vstack((a, b)) Out: array([[ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True]], dtype=bool) Splitting NumPy arrays Arrays can be split vertically, horizontally, or depth wise. The functions involved are hsplit(), vsplit(), dsplit(), and split(). We can split arrays either into arrays of the same shape or indicate the location after which the split should happen. Let's look at each of the functions in detail: Horizontal splitting: The following code splits a 3 x 3 array on its horizontal axis into three parts of the same size and shape (see splitting.py in this book's code bundle): In: a Out: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In: np.hsplit(a, 3) Out: [array([[0], [3], [6]]), array([[1], [4], [7]]), array([[2], [5], [8]])] Liken it with a call of the split() function, with an additional argument, axis=1: In: np.split(a, 3, axis=1) Out: [array([[0], [3], [6]]), array([[1], [4], [7]]), array([[2], [5], [8]])] Vertical splitting: vsplit() splits along the vertical axis: In: np.vsplit(a, 3) Out: [array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])] The split() function, with axis=0, also splits along the vertical axis: In: np.split(a, 3, axis=0) Out: [array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])] Depth-wise splitting: The dsplit() function, unsurprisingly, splits depth-wise. We will require an array of rank 3 to begin with: In: c = np.arange(27).reshape(3, 3, 3) In: c Out: array([[[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8]], [[ 9, 10, 11], [12, 13, 14], [15, 16, 17]], [[18, 19, 20], [21, 22, 23], [24, 25, 26]]]) In: np.dsplit(c, 3) Out: [array([[[ 0], [ 3], [ 6]], [[ 9], [12], [15]], [[18], [21], [24]]]), array([[[ 1], [ 4], [ 7]], [[10], [13], [16]], [[19], [22], [25]]]), array([[[ 2], [ 5], [ 8]], [[11], [14], [17]], [[20], [23], [26]]])] NumPy array attributes Let's learn more about the NumPy array attributes with the help of an example. Let us create an array b that we shall use for practicing the further examples: In: b = np.arange(24).reshape(2, 12) In: b Out: array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]) Besides the shape and dtype attributes, ndarray has a number of other properties, as shown in the following list: ndim gives the number of dimensions, as shown in the following code snippet: In: b.ndim Out: 2 size holds the count of elements. This is shown as follows: In: b.size Out: 24 itemsize returns the count of bytes for each element in the array, as shown in the following code snippet: In: b.itemsize Out: 8 If you require the full count of bytes the array needs, you can have a look at nbytes. This is just a product of the itemsize and size properties: In: b.nbytes Out: 192 In: b.size * b.itemsize Out: 192 The T property has the same result as the transpose() function, which is shown as follows: In: b.resize(6,4) In: b Out: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]) In: b.T Out: array([[ 0, 4, 8, 12, 16, 20], [ 1, 5, 9, 13, 17, 21], [ 2, 6, 10, 14, 18, 22], [ 3, 7, 11, 15, 19, 23]]) If the array has a rank of less than 2, we will just get a view of the array: In: b.ndim Out: 1 In: b.T Out: array([0, 1, 2, 3, 4]) Complex numbers in NumPy are represented by j. For instance, we can produce an array with complex numbers as follows: In: b = np.array([1.j + 1, 2.j + 3]) In: b Out: array([ 1.+1.j, 3.+2.j]) The real property returns to us the real part of the array, or the array itself if it only holds real numbers: In: b.real Out: array([ 1., 3.]) The imag property holds the imaginary part of the array: In: b.imag Out: array([ 1., 2.]) If the array holds complex numbers, then the data type will automatically be complex as well: In: b.dtype Out: dtype('complex128') In: b.dtype.str Out: '<c16' The flat property gives back a numpy.flatiter object. This is the only means to get a flatiter object; we do not have access to a flatiter constructor. The flat iterator enables us to loop through an array as if it were a flat array, as shown in the following code snippet: In: b = np.arange(4).reshape(2,2) In: b Out: array([[0, 1], [2, 3]]) In: f = b.flat In: f Out: <numpy.flatiter object at 0x103013e00> In: for item in f: print(item) Out: 0 1 2 3 It is possible to straightaway obtain an element with the flatiter object: In: b.flat[2] Out: 2 Also, you can obtain multiple elements as follows: In: b.flat[[1,3]] Out: array([1, 3]) The flat property can be set. Setting the value of the flat property leads to overwriting the values of the entire array: In: b.flat = 7 In: b Out: array([[7, 7], [7, 7]]) We can also obtain selected elements as follows: In: b.flat[[1,3]] = 1 In: b Out: array([[7, 1], [7, 1]]) The next diagram illustrates various properties of ndarray: Converting arrays We can convert a NumPy array to a Python list with the tolist() function . The following is a brief explanation: Convert to a list: In: b Out: array([ 1.+1.j, 3.+2.j]) In: b.tolist() Out: [(1+1j), (3+2j)] The astype() function transforms the array to an array of the specified data type: In: b Out: array([ 1.+1.j, 3.+2.j]) In: b.astype(int) /usr/local/lib/python3.5/site-packages/ipykernel/__main__.py:1: ComplexWarning: Casting complex values to real discards the imaginary part … Out: array([1, 3]) In: b.astype('complex') Out: array([ 1.+1.j, 3.+2.j]) We are dropping off the imaginary part when casting from the complex type to int. The astype() function takes the name of a data type as a string too. The preceding code won't display a warning this time because we used the right data type. Summary In this article, we found out a heap about the NumPy basics: data types and arrays. Arrays have various properties that describe them. You learned that one of these properties is the data type, which, in NumPy, is represented by a full-fledged object. NumPy arrays can be sliced and indexed in an effective way, compared to standard Python lists. NumPy arrays have the extra ability to work with multiple dimensions. The shape of an array can be modified in multiple ways, such as stacking, resizing, reshaping, and splitting. Resources for Article: Further resources on this subject: Big Data Analytics [article] Python Data Science Up and Running [article] R and its Diverse Possibilities [article]
Read more
  • 0
  • 0
  • 40385
article-image-understanding-spark-rdd
Packt
01 Mar 2017
17 min read
Save for later

Understanding Spark RDD

Packt
01 Mar 2017
17 min read
In this article by Asif Abbasi author of the book Learning Apache Spark 2.0, we will understand Spark RDD along with that we will learn, how to construct RDDs, Operations on RDDs, Passing functions to Spark in Scala, Java, and Python and Transformations such as map, filter, flatMap, and sample. (For more resources related to this topic, see here.) What is an RDD? What’s in a name might be true for a rose, but perhaps not for an Resilient Distributed Datasets (RDD), and in essence describes what an RDD is. They are basically datasets, which are distributed across a cluster (remember Spark framework is inherently based on an MPP architecture), and provide resilience (automatic failover) by nature. Before we go into any further detail, let’s try to understand this a little bit, and again we are trying to be as abstract as possible. Let us assume that you have a sensor data from aircraft sensors and you want to analyze the data irrespective of its size and locality. For example, an Airbus A350 has roughly 6000 sensors across the entire plane and generates 2.5 TB data per day, while the newer model expected to launch in 2020 will generate roughly 7.5 TB per day. From a data engineering point of view, it might be important to understand the data pipeline, but from an analyst and a data scientist point of view, your major concern is to analyze the data irrespective of the size and number of nodes across which it has been stored. This is where the neatness of the RDD concept comes into play, where the sensor data can be encapsulated as an RDD concept, and any transformation/action that you perform on the RDD applies across the entire dataset. Six month's worth of dataset for an A350 would be approximately 450 TBs of data, and would need to sit across multiple machines. For the sake of discussion, we assume that you are working on a cluster of four worker machines. Your data would be partitioned across the workers as follows: Figure 2-1: RDD split across a cluster The figure basically explains that an RDD is a distributed collection of the data, and the framework distributes the data across the cluster. Data distribution across a set of machines brings its own set of nuisances including recovering from node failures. RDD’s are resilient as they can be recomputed from the RDD lineage graph, which is basically a graph of the entire parent RDDs of the RDD. In addition to the resilience, distribution, and representing a data set, an RDD has various other distinguishing qualities: In Memory: An RDD is a memory resident collection of objects. We’ll look at options where an RDD can be stored in memory, on disk, or both. However, the execution speed of Spark stems from the fact that the data is in memory, and is not fetched from disk for each operation. Partitioned: A partition is a division of a logical dataset or constituent elements into independent parts. Partitioning is a defacto performance optimization technique in distributed systems to achieve minimal network traffic, a killer for high performance workloads. The objective of partitioning in key-value oriented data is to collocate similar range of keys and in effect minimize shuffling. Data inside RDD is split into partitions and across various nodes of the cluster. Typed: Data in an RDD is strongly typed. When you create an RDD, all the elements are typed depending on the data type. Lazy evaluation: The transformations in Spark are lazy, which means data inside RDD is not available until you perform an action. You can, however, make the data available at any time using a count() action on the RDD. We’ll discuss this later and the benefits associated with it. Immutable: An RDD once created cannot be changed. It can, however, be transformed into a new RDD by performing a set of transformations on it. Parallel: An RDD is operated on in parallel. Since the data is spread across a cluster in various partitions, each partition is operated on in parallel. Cacheable: Since RDD’s are lazily evaluated, any action on an RDD will cause the RDD to revaluate all transformations that led to the creation of RDD. This is generally not a desirable behavior on large datasets, and hence Spark allows the option to persist the data on memory or disk. A typical Spark program flow with an RDD includes: Creation of an RDD from a data source. A set of transformations, for example, filter, map, join, and so on. Persisting the RDD to avoid re-execution. Calling actions on the RDD to start performing parallel operations across the cluster. This is depicted in the following figure: Figure 2-2: Typical Spark RDD flow Operations on RDD Two major operation types can be performed on an RDD. They are called: Transformations Actions Transformations Transformations are operations that create a new dataset, as RDDs are immutable. They are used to transform data from one to another, which could result in amplification of the data, reduction of the data, or a totally different shape altogether. These operations do not return any value back to the driver program, and hence lazily evaluated, which is one of the main benefits of Spark. An example of a transformation would be a map function that will pass through each element of the RDD and return a totally new RDD representing the results of application of the function on the original dataset. Actions Actions are operations that return a value to the driver program. As previously discussed, all transformations in Spark are lazy, which essentially means that Spark remembers all the transformations carried out on an RDD, and applies them in the most optimal fashion when an action is called. For example, you might have a 1 TB dataset, which you pass through a set of map functions by applying various transformations. Finally, you apply the reduce action on the dataset. Apache Spark will return only a final dataset, which might be few MBs rather than the entire 1 TB dataset of mapped intermediate result. You should, however, remember to persist intermediate results otherwise Spark will recompute the entire RDD graph each time an Action is called. The persist() method on an RDD should help you avoid recomputation and saving intermediate results. We’ll look at this in more detail later. Let’s illustrate the work of transformations and actions by a simple example. In this specific example, we’ll be using flatmap() transformations and a count action. We’ll use the README.md file from the local filesystem as an example. We’ll give a line-by-line explanation of the Scala example, and then provide code for Python and Java. As always, you must try this example with your own piece of text and investigate the results: //Loading the README.md file val dataFile = sc.textFile(“README.md”) Now that the data has been loaded, we’ll need to run a transformation. Since we know that each line of the text is loaded as a separate element, we’ll need to run a flatMap transformation and separate out individual words as separate elements, for which we’ll use the split function and use space as a delimiter: //Separate out a list of words from individual RDD elements val words = dataFile.flatMap(line => line.split(“ “)) Remember that until this point, while you seem to have applied a transformation function, nothing has been executed and all the transformations have been added to the logical plan. Also note that the transformation function returns a new RDD. We can then call the count() action on the words RDD, to perform the computation, which then results in fetching of data from the file to create an RDD, before applying the transformation function specified. You might note that we have actually passed a function to Spark: //Separate out a list of words from individual RDD elements Words.count() Upon calling the count() action the RDD is evaluated, and the results are sent back to the driver program. This is very neat and especially useful during big data applications. If you are Python savvy, you may want to run the following code in PySpark. You should note that lambda functions are passed to the Spark framework: //Loading data file, applying transformations and action dataFile = sc.textFile("README.md") words = dataFile.flatMap(lambda line: line.split(" ")) words.count() Programming the same functionality in Java is also quite straight forward and will look pretty similar to the program in Scala: JavaRDD<String> lines = sc.textFile("README.md"); JavaRDD<String> words = lines.map(line -> line.split(“ “)); int wordCount = words.count(); This might look like a simple program, but behind the scenes it is taking the line.split(“ ”) function and applying it to all the partitions in the cluster in parallel. The framework provides this simplicity and does all the background work of coordination to schedule it across the cluster, and get the results back. Passing functions to Spark (Scala) As you have seen in the previous example, passing functions is a critical functionality provided by Spark. From a user’s point of view you would pass the function in your driver program, and Spark would figure out the location of the data partitions across the cluster memory, running it in parallel. The exact syntax of passing functions differs by the programming language. Since Spark has been written in Scala, we’ll discuss Scala first. In Scala, the recommended ways to pass functions to the Spark framework are as follows: Anonymous functions Static singleton methods Anonymous functions Anonymous functions are used for short pieces of code. They are also referred to as lambda expressions, and are a cool and elegant feature of the programming language. The reason they are called anonymous functions is because you can give any name to the input argument and the result would be the same. For example, the following code examples would produce the same output: val words = dataFile.map(line => line.split(“ “)) val words = dataFile.map(anyline => anyline.split(“ “)) val words = dataFile.map(_.split(“ “)) Figure 2-11: Passing anonymous functions to Spark in Scala Static singleton functions While anonymous functions are really helpful for short snippets of code, they are not very helpful when you want to request the framework for a complex data manipulation. Static singleton functions come to the rescue with their own nuances, which we will discuss in this section. In software engineering, the Singleton pattern is a design pattern that restricts instantiation of a class to one object. This is useful when exactly one object is needed to coordinate actions across the system. Static methods belong to the class and not an instance of it. They usually take input from the parameters, perform actions on it, and return the result. Figure 2-12: Passing static singleton functions to Spark in Scala Static singleton is the preferred way to pass functions, as technically you can create a class and call a method in the class instance. For example: class UtilFunctions{ def split(inputParam: String): Array[String] = {inputParam.split(“ “)} def operate(rdd: RDD[String]): RDD[String] ={ rdd.map(split)} } You can send a method in a class, but that has performance implications as the entire object would be sent along the method. Passing functions to Spark (Java) In Java, to create a function you will have to implement the interfaces available in the org.apache.spark.api.java function package. There are two popular ways to create such functions: Implement the interface in your own class, and pass the instance to Spark. Starting Java 8, you can use lambda expressions to pass off the functions to the Spark framework. Let’s reimplement the preceding word count examples in Java: Figure 2-13: Code example of Java implementation of word count (inline functions) If you belong to a group of programmers who feel that writing inline functions makes the code complex and unreadable (a lot of people do agree to that assertion), you may want to create separate functions and call them as follows: Figure 2-14: Code example of Java implementation of word count Passing functions to Spark (Python) Python provides a simple way to pass functions to Spark. The Spark programming guide available at spark.apache.org suggests there are three recommended ways to do this: Lambda expressions: The ideal way for short functions that can be written inside a single expression Local defs inside the function calling into Spark for longer code Top-level functions in a module While we have already looked at the lambda functions in some of the previous examples, let’s look at local definitions of the functions. Our example stays the same, which is we are trying to count the total number of words in a text file in Spark: def splitter(lineOfText): words = lineOfText.split(" ") return len(words) def aggregate(numWordsLine1, numWordsLineNext): totalWords = numWordsLine1 + numWordsLineNext return totalWords Let’s see the working code example: Figure 2-15: Code example of Python word count (local definition of functions) Here’s another way to implement this by defining the functions as a part of a UtilFunctions class, and referencing them within your map and reduce functions: Figure 2-16: Code example of Python word count (Utility class) You may want to be a bit cheeky here and try to add a countWords() to the UtilFunctions, so that it takes an RDD as input, and returns the total number of words. This method has potential performance implications as the whole object will need to be sent to the cluster. Let’s see how this can be implemented and the results in the following screenshot: Figure 2-17: Code example of Python word count (Utility class - 2) This can be avoided by making a copy of the referenced data field in a local object, rather than accessing it externally. Now that we have had a look at how to pass functions to Spark, and have already looked at some of the transformations and actions in the previous examples, including map, flatMap, and reduce, let’s look at the most common transformations and actions used in Spark. The list is not exhaustive, and you can find more examples in the Apache Spark documentation in the programming guide. If you would like to get a comprehensive list of all the available functions, you might want to check the following API docs:   RDD PairRDD Scala http://bit.ly/2bfyoTo http://bit.ly/2bfzgah Python http://bit.ly/2bfyURl N/A Java http://bit.ly/2bfyRov http://bit.ly/2bfyOsH R http://bit.ly/2bfyrOZ N/A Table 2.1 – RDD and PairRDD API references Transformations The following table shows the most common transformations: map(func) coalesce(numPartitions) filter(func) repartition(numPartitions) flatMap(func) repartitionAndSortWithinPartitions(partitioner) mapPartitions(func) join(otherDataset, [numTasks]) mapPartitionsWithIndex(func) cogroup(otherDataset, [numTasks]) sample(withReplacement, fraction, seed) cartesian(otherDataset) Map(func) The map transformation is the most commonly used and the simplest of transformations on an RDD. The map transformation applies the function passed in the arguments to each of the elements of the source RDD. In the previous examples, we have seen the usage of map() transformation where we have passed the split() function to the input RDD. Figure 2-18: Operation of a map() function We’ll not give examples of map() functions as we have already seen plenty of examples of map functions previously. Filter (func) Filter, as the name implies, filters the input RDD, and creates a new dataset that satisfies the predicate passed as arguments. Example 2-1: Scala filtering example: val dataFile = sc.textFile(“README.md”) val linesWithApache = dataFile.filter(line => line.contains(“Apache”)) Example 2-2: Python filtering example: dataFile = sc.textFile(“README.md”) linesWithApache = dataFile.filter(lambda line: “Apache” in line) Example 2-3: Java filtering example: JavaRDD<String> dataFile = sc.textFile(“README.md”) JavaRDD<String> linesWithApache = dataFile.filter(line -> line.contains(“Apache”)); flatMap(func) The flatMap transformation is similar to map, but it offers a bit more flexibility. From the perspective of similarity to a map function, it operates on all the elements of the RDD, but the flexibility stems from its ability to handle functions that return a sequence rather than a single item. As you saw in the preceding examples, we had used flatMap to flatten the result of the split(“”) function, which returns a flattened structure rather than an RDD of string arrays. Figure 2-19: Operational details of the flatMap() transformation Let’s look at the flatMap example in Scala. Example 2-4: The flatmap() example in Scala: val favMovies = sc.parallelize(List("Pulp Fiction","Requiem for a dream","A clockwork Orange")); movies.flatMap(movieTitle=>movieTitle.split(" ")).collect() A flatMap in Python API would produce similar results. Example 2-5: The flatmap() example in Python: movies = sc.parallelize(["Pulp Fiction","Requiem for a dream","A clockwork Orange"]) movies.flatMap(lambda movieTitle: movieTitle.split(" ")).collect() The flatMap example in Java is a bit long-winded, but it essentially produces the same results. Example 2-6: The flatmap() example in Java: JavaRDD<String> movies = sc.parallelize (Arrays.asList("Pulp Fiction","Requiem for a dream" ,"A clockwork Orange") ); JavaRDD<String> movieName = movies.flatMap( new FlatMapFunction<String,String>(){ public Iterator<String> call(String movie){ return Arrays.asList(movie.split(" ")) .iterator(); } } ); Sample(withReplacement, fraction, seed) Sampling is an important component of any data analysis and it can have a significant impact on the quality of your results/findings. Spark provides an easy way to sample RDD’s for your calculations, if you would prefer to quickly test your hypothesis on a subset of data before running it on a full dataset. But here is a quick overview of the parameters that are passed onto the method: withReplacement: Is a Boolean (True/False), and it indicates if elements can be sampled multiple times (replaced when sampled out). Sampling with replacement means that the two sample values are independent. In practical terms this means that if we draw two samples with replacement, what we get on the first one doesn’t affect what we get on the second draw, and hence the covariance between the two samples is zero. If we are sampling without replacement, the two samples aren’t independent. Practically this means what we got on the first draw affects what we get on the second one and hence the covariance between the two isn’t zero. fraction: Fraction indicates the expected size of the sample as a fraction of the RDD’s size. The fraction must be between 0 and 1. For example, if you want to draw a 5% sample, you can choose 0.05 as a fraction. seed: The seed used for the random number generator. Let’s look at the sampling example in Scala. Example 2-7: The sample() example in Scala: val data = sc.parallelize( List(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); data.sample(true,0.1,12345).collect() The sampling example in Python looks similar to the one in Scala. Example 2-8: The sample() example in Python: data = sc.parallelize( [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]) data.sample(1,0.1,12345).collect() In Java, our sampling example returns an RDD of integers. Example 2-9: The sample() example in Java: JavaRDD<Integer> nums = sc.parallelize(Arrays.asList( 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); nums.sample(true,0.1,12345).collect(); References https://spark.apache.org/docs/latest/programming-guide.html http://www.purplemath.com/modules/numbprop.htm Summary We have gone through the concept of creating an RDD, to manipulating data within the RDD. We’ve looked at the transformations and actions available to an RDD, and walked you through various code examples to explain the differences between transformations and actions Resources for Article: Further resources on this subject: Getting Started with Apache Spark [article] Getting Started with Apache Spark DataFrames [article] Sabermetrics with Apache Spark [article]
Read more
  • 0
  • 0
  • 13009

article-image-review-sql-server-features-developers
Packt
13 Feb 2017
43 min read
Save for later

Review of SQL Server Features for Developers

Packt
13 Feb 2017
43 min read
In this article by Dejan Sarka, Miloš Radivojević, and William Durkin, the authors of the book, SQL Server 2016 Developer's Guide explains that before dwelling into the new features in SQL Server 2016, let's make a quick recapitulation of the SQL Server features for developers available already in the previous versions of SQL Server. Recapitulating the most important features with help you remember what you already have in your development toolbox and also understanding the need and the benefits of the new or improved features in SQL Server 2016. The recapitulation starts with the mighty T-SQL SELECT statement. Besides the basic clauses, advanced techniques like window functions, common table expressions, and APPLY operator are explained. Then you will pass quickly through creating and altering database objects, including tables and programmable objects, like triggers, views, user-defined functions, and stored procedures. You will also review the data modification language statements. Of course, errors might appear, so you have to know how to handle them. In addition, data integrity rules might require that two or more statements are executed as an atomic, indivisible block. You can achieve this with help of transactions. The last section of this article deals with parts of SQL Server Database Engine marketed with a common name "Beyond Relational". This is nothing beyond the Relational Model, the "beyond relational" is really just a marketing term. Nevertheless, you will review the following: How SQL Server supports spatial data How you can enhance the T-SQL language with Common Language Runtime (CLR) elements written is some .NET language like Visual C# How SQL Server supports XML data The code in this article uses the WideWorldImportersDW demo database. In order to test the code, this database must be present in your SQL Server instance you are using for testing, and you must also have SQL Server Management Studio (SSMS) as the client tool. This article will cover the following points: Core Transact-SQL SELECT statement elements Advanced SELECT techniques Error handling Using transactions Spatial data XML support in SQL Server (For more resources related to this topic, see here.) The Mighty Transact-SQL SELECT You probably already know that the most important SQL statement is the mighty SELECT statement you use to retrieve data from your databases. Every database developer knows the basic clauses and their usage: SELECT to define the columns returned, or a projection of all table columns FROM to list the tables used in the query and how they are associated, or joined WHERE to filter the data to return only the rows that satisfy the condition in the predicate GROUP BY to define the groups over which the data is aggregated HAVING to filter the data after the grouping with conditions that refer to aggregations ORDER BY to sort the rows returned to the client application Besides these basic clauses, SELECT offers a variety of advanced possibilities as well. These advanced techniques are unfortunately less exploited by developers, although they are really powerful and efficient. Therefore, I urge you to review them and potentially use them in your applications. The advanced query techniques presented here include: Queries inside queries, or shortly subqueries Window functions TOP and OFFSET...FETCH expressions APPLY operator Common tables expressions, or CTEs Core Transact-SQL SELECT Statement Elements Let us start with the most simple concept of SQL which every Tom, Dick, and Harry is aware of! The simplest query to retrieve the data you can write includes the SELECT and the FROM clauses. In the select clause, you can use the star character, literally SELECT *, to denote that you need all columns from a table in the result set. The following code switches to the WideWorldImportersDW database context and selects all data from the Dimension.Customer table. USE WideWorldImportersDW; SELECT * FROM Dimension.Customer; The code returns 403 rows, all customers with all columns. Using SELECT * is not recommended in production. Such queries can return an unexpected result when the table structure changes, and is also not suitable for good optimization. Better than using SELECT * is to explicitly list only the columns you need. This means you are returning only a projection on the table. The following example selects only four columns from the table. SELECT [Customer Key], [WWI Customer ID], [Customer], [Buying Group] FROM Dimension.Customer; Below is the shortened result, limited to the first three rows only. Customer Key WWI Customer ID Customer Buying Group ------------ --------------- ----------------------------- ------------- 0 0 Unknown N/A 1 1 Tailspin Toys (Head Office) Tailspin Toys 2 2 Tailspin Toys (Sylvanite, MT) Tailspin Toys You can see that the column names in the WideWorldImportersDW database include spaces. Names that include spaces are called delimited identifiers. In order to make SQL Server properly understand them as column names, you must enclose delimited identifiers in square parentheses. However, if you prefer to have names without spaces, or is you use computed expressions in the column list, you can add column aliases. The following query returns completely the same data as the previous one, just with columns renamed by aliases to avoid delimited names. SELECT [Customer Key] AS CustomerKey, [WWI Customer ID] AS CustomerId, [Customer], [Buying Group] AS BuyingGroup FROM Dimension.Customer; You might have noticed in the result set returned from the last two queries that there is also a row in the table for an unknown customer. You can filter this row with the WHERE clause. SELECT [Customer Key] AS CustomerKey, [WWI Customer ID] AS CustomerId, [Customer], [Buying Group] AS BuyingGroup FROM Dimension.Customer WHERE [Customer Key] <> 0; In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, like customers in the examples you have seen so far. In order to get result sets meaningful for the business your database supports, you most of the time need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. Rows returned are those for which the condition in the join predicate for the two tables joined evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL keyword. A predicate can thus evaluate to true, false or NULL. For an inner join, the order of the tables involved in the join is not important. In the following example, you can see the Fact.Sale table joined with an inner join to the Dimension.Customer table. SELECT c.[Customer Key] AS CustomerKey, c.[WWI Customer ID] AS CustomerId, c.[Customer], c.[Buying Group] AS BuyingGroup, f.Quantity, f.[Total Excluding Tax] AS Amount, f.Profit FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key]; In the query, you can see that table aliases are used. If a column's name is unique across all tables in the query, then you can use it without table name. If not, you need to use table name in front of the column, to avoid ambiguous column names, in the format table.column. In the previous query, the [Customer Key] column appears in both tables. Therefore, you need to precede this column name with the table name of its origin to avoid ambiguity. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM clause. Once you specify table aliases, you must always use the aliases; you can't refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, later somebody could add a column with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query. The previous query returns 228,265 rows. It is always recommendable to know at least approximately the number of rows your query should return. This number is the first control of the correctness of the result set, or said differently, whether the query is written logically correct. The query returns the unknown customer and the orders associated for this customer, of more precisely said associated to this placeholder for an unknown customer. Of course, you can use the WHERE clause to filter the rows in a query that joins multiple tables, like you use it for a single table query. The following query filters the unknown customer rows. SELECT c.[Customer Key] AS CustomerKey, c.[WWI Customer ID] AS CustomerId, c.[Customer], c.[Buying Group] AS BuyingGroup, f.Quantity, f.[Total Excluding Tax] AS Amount, f.Profit FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0; The query returns 143,968 rows. You can see that a lot of sales is associated with the unknown customer. Of course, the Fact.Sale table cannot be joined to the Dimension.Customer table. The following query joins it to the Dimension.Date table. Again, the join performed is an inner join. SELECT d.Date, f.[Total Excluding Tax], f.[Delivery Date Key] FROM Fact.Sale AS f INNER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date; The query returns 227,981 rows. The query that joined the Fact.Sale table to the Dimension.Customer table returned 228,265 rows. It looks like not all Fact.Sale table rows have a known delivery date, not all rows can match the Dimension.Date table rows. You can use an outer join to check this. With an outer join, you preserve the rows from one or both tables, even if they don't have a match in the other table. The result set returned includes all of the matched rows like you get from an inner join plus the preserved rows. Within an outer join, the order of the tables involved in the join might be important. If you use LEFT OUTER JOIN, then the rows from the left table are preserved. If you use RIGHT OUTER JOIN, then the rows from the right table are preserved. Of course, in both cases, the order of the tables involved in the join is important. With a FULL OUTER JOIN, you preserve the rows from both tables, and the order of the tables is not important. The following query preserves the rows from the Fact.Sale table, which is on the left side of the join to the Dimension.Date table. In addition, the query sorts the result set by the invoice date descending using the ORDER BY clause. SELECT d.Date, f.[Total Excluding Tax], f.[Delivery Date Key], f.[Invoice Date Key] FROM Fact.Sale AS f LEFT OUTER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date ORDER BY f.[Invoice Date Key] DESC; The query returns 228,265 rows. Here is the partial result of the query. Date Total Excluding Tax Delivery Date Key Invoice Date Key ---------- -------------------- ----------------- ---------------- NULL 180.00 NULL 2016-05-31 NULL 120.00 NULL 2016-05-31 NULL 160.00 NULL 2016-05-31 … … … … 2016-05-31 2565.00 2016-05-31 2016-05-30 2016-05-31 88.80 2016-05-31 2016-05-30 2016-05-31 50.00 2016-05-31 2016-05-30 For the last invoice date (2016-05-31), the delivery date is NULL. The NULL in the Date column form the Dimension.Date table is there because the data from this table is unknown for the rows with an unknown delivery date in the Fact.Sale table. Joining more than two tables is not tricky if all of the joins are inner joins. The order of joins is not important. However, you might want to execute an outer join after all of the inner joins. If you don't control the join order with the outer joins, it might happen that a subsequent inner join filters out the preserved rows if an outer join. You can control the join order with parenthesis. The following query joins the Fact.Sale table with an inner join to the Dimension.Customer, Dimension.City, Dimension.[Stock Item], and Dimension.Employee tables, and with an left outer join to the Dimension.Date table. SELECT cu.[Customer Key] AS CustomerKey, cu.Customer, ci.[City Key] AS CityKey, ci.City, ci.[State Province] AS StateProvince, ci.[Sales Territory] AS SalesTeritory, d.Date, d.[Calendar Month Label] AS CalendarMonth, d.[Calendar Year] AS CalendarYear, s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product, s.Color, e.[Employee Key] AS EmployeeKey, e.Employee, f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit FROM (Fact.Sale AS f INNER JOIN Dimension.Customer AS cu ON f.[Customer Key] = cu.[Customer Key] INNER JOIN Dimension.City AS ci ON f.[City Key] = ci.[City Key] INNER JOIN Dimension.[Stock Item] AS s ON f.[Stock Item Key] = s.[Stock Item Key] INNER JOIN Dimension.Employee AS e ON f.[Salesperson Key] = e.[Employee Key]) LEFT OUTER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date; The query returns 228,265 rows. Note that with the usage of the parenthesis the order of joins is defined in the following way: Perform all inner joins, with an arbitrary order among them Execute the left outer join after all of the inner joins So far, I have tacitly assumed that the Fact.Sale table has 228,265 rows, and that the previous query needed only one outer join of the Fact.Sale table with the Dimension.Date to return all of the rows. It would be good to check this number in advance. You can check the number of rows by aggregating them using the COUNT(*) aggregate function. The following query introduces that function. SELECT COUNT(*) AS SalesCount FROM Fact.Sale; Now you can be sure that the Fact.Sale table has exactly 228,265 rows. Many times you need to aggregate data in groups. This is the point where the GROUP BY clause becomes handy. The following query aggregates the sales data for each customer. SELECT c.Customer, SUM(f.Quantity) AS TotalQuantity, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer; The query returns 402 rows, one for each known customer. In the SELECT clause, you can have only the columns used for grouping, or aggregated columns. You need to get a scalar, a single aggregated value for each row for each column not included in the GROUP BY list. Sometimes you need to filter aggregated data. For example, you might need to find only frequent customers, defined as customers with more than 400 rows in the Fact.Sale table. You can filter the result set on the aggregated data by using the HAVING clause, like the following query shows. SELECT c.Customer, SUM(f.Quantity) AS TotalQuantity, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer HAVING COUNT(*) > 400; The query returns 45 rows for 45 most frequent known customers. Note that you can't use column aliases from the SELECT clause in any other clause introduced in the previous query. The SELECT clause logically executes after all other clause from the query, and the aliases are not known yet. However, the ORDER BY clause executes after the SELECT clause, and therefore the columns aliases are already known and you can refer to them. The following query shows all of the basic SELECT statement clauses used together to aggregate the sales data over the known customers, filters the data to include the frequent customers only, and sorts the result set descending by the number of rows of each customer in the Fact.Sale table. SELECT c.Customer, SUM(f.Quantity) AS TotalQuantity, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer HAVING COUNT(*) > 400 ORDER BY InvoiceLinesCountDESC; The query returns 45 rows. Below is the shortened result set. Customer TotalQuantity TotalAmount SalesCount ------------------------------------- ------------- ------------ ----------- Tailspin Toys (Vidrine, LA) 18899 340163.80 455 Tailspin Toys (North Crows Nest, IN) 17684 313999.50 443 Tailspin Toys (Tolna, ND) 16240 294759.10 443 Advanced SELECT Techniques Aggregating data over the complete input rowset or aggregating in groups produces aggregated rows only – either one row for the whole input rowset or one row per group. Sometimes you need to return aggregates together with the detail data. One way to achieve this is by using subqueries, queries inside queries. The following query shows an example of using two subqueries in a single query. In the SELECT clause, a subquery that calculates the sum of quantity for each customer. It returns a scalar value. The subquery refers to the customer key from the outer query. The subquery can't execute without the outer query. This is a correlated subquery. There is another subquery in the FROM clause that calculates overall quantity for all customers. This query returns a table, although it is a table with a single row and single column. This query is a self-contained subquery, independent of the outer query. A subquery in the FROM clause is also called a derived table. Another type of join is used to add the overall total to each detail row. A cross join is a Cartesian product of two input rowsets—each row from one side is associated with every single row from the other side. No join condition is needed. A cross join can produce an unwanted huge result set. For example, if you cross join just a 1,000 rows from the left side of the join with 1,000 rows from the right side, you get 1,000,000 rows in the output. Therefore, typically you want to avoid a cross join in production. However, in the example in the following query, 143,968 from the left side rows is cross joined to a single row from the subquery, therefore producing 143,968 only. Effectively, this means that the overall total column is added to each detail row. SELECT c.Customer, f.Quantity, (SELECT SUM(f1.Quantity) FROM Fact.Sale AS f1 WHERE f1.[Customer Key] = c.[Customer Key]) AS TotalCustomerQuantity, f2.TotalQuantity FROM (Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key]) CROSS JOIN (SELECT SUM(f2.Quantity) FROM Fact.Sale AS f2 WHERE f2.[Customer Key] <> 0) AS f2(TotalQuantity) WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.Quantity DESC; Here is an abbreviated output of the query. Customer Quantity TotalCustomerQuantity TotalQuantity ---------------------------- ----------- --------------------- ------------- Tailspin Toys (Absecon, NJ) 360 12415 5667611 Tailspin Toys (Absecon, NJ) 324 12415 5667611 Tailspin Toys (Absecon, NJ) 288 12415 5667611 In the previous example, the correlated subquery in the SELECT clause has to logically execute once per row of the outer query. The query was partially optimized by moving the self-contained subquery for the overall total in the FROM clause, where logically executes only once. Although SQL Server can many times optimize correlated subqueries and convert them to joins, there exist also a much better and more efficient way to achieve the same result as the previous query returned. You can do this by using the window functions. The following query is using the window aggregate function SUM to calculate the total over each customer and the overall total. The OVER clause defines the partitions, or the windows of the calculation. The first calculation is partitioned over each customer, meaning that the total quantity per customer is reset to zero for each new customer. The second calculation uses an OVER clause without specifying partitions, thus meaning the calculation is done over all input rowset. This query produces exactly the same result as the previous one/ SELECT c.Customer, f.Quantity, SUM(f.Quantity) OVER(PARTITION BY c.Customer) AS TotalCustomerQuantity, SUM(f.Quantity) OVER() AS TotalQuantity FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.Quantity DESC; You can use many other functions for window calculations. For example, you can use the ranking functions, like ROW_NUMBER(), to calculate some rank in the window or in the overall rowset. However, rank can be defined only over some order of the calculation. You can specify the order of the calculation in the ORDER BY sub-clause inside the OVER clause. Please note that this ORDER BY clause defines only the logical order of the calculation, and not the order of the rows returned. A stand-alone, outer ORDER BY at the end of the query defines the order of the result. The following query calculates a sequential number, the row number of each row in the output, for each detail row of the input rowset. The row number is calculated once in partitions for each customer and once ever the whole input rowset. Logical order of calculation is over quantity descending, meaning that row number 1 gets the largest quantity, either the largest for each customer or the largest in the whole input rowset. SELECT c.Customer, f.Quantity, ROW_NUMBER() OVER(PARTITION BY c.Customer ORDER BY f.Quantity DESC) AS CustomerOrderPosition, ROW_NUMBER() OVER(ORDER BY f.Quantity DESC) AS TotalOrderPosition FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.Quantity DESC; The query produces the following result, abbreviated to couple of rows only again. Customer Quantity CustomerOrderPosition TotalOrderPosition ----------------------------- ----------- --------------------- -------------------- Tailspin Toys (Absecon, NJ) 360 1 129 Tailspin Toys (Absecon, NJ) 324 2 162 Tailspin Toys (Absecon, NJ) 288 3 374 … … … … Tailspin Toys (Aceitunas, PR) 288 1 392 Tailspin Toys (Aceitunas, PR) 250 4 1331 Tailspin Toys (Aceitunas, PR) 250 3 1315 Tailspin Toys (Aceitunas, PR) 250 2 1313 Tailspin Toys (Aceitunas, PR) 240 5 1478 Note the position, or the row number, for the second customer. The order does not look to be completely correct – it is 1, 4, 3, 2, 5, and not 1, 2, 3, 4, 5, like you might expect. This is due to repeating value for the second largest quantity, for the quantity 250. The quantity is not unique, and thus the order is not deterministic. The order of the result is defined over the quantity, and not over the row number. You can't know in advance which row will get which row number when the order of the calculation is not defined on unique values. Please also note that you might get a different order when you execute the same query on your SQL Server instance. Window functions are useful for some advanced calculations, like running totals and moving averages as well. However, the calculation of these values can't be performed over the complete partition. You can additionally frame the calculation to a subset of rows of each partition only. The following query calculates the running total of the quantity per customer (the column alias Q_RT in the query) ordered by the sale key and framed differently for each row. The frame is defined from the first row in the partition to the current row. Therefore, the running total is calculated over one row for the first row, over two rows for the second row, and so on. Additionally, the query calculates the moving average of the quantity (the column alias Q_MA in the query) for the last three rows. SELECT c.Customer, f.[Sale Key] AS SaleKey, f.Quantity, SUM(f.Quantity) OVER(PARTITION BY c.Customer ORDER BY [Sale Key] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Q_RT, AVG(f.Quantity) OVER(PARTITION BY c.Customer ORDER BY [Sale Key] ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS Q_MA FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, f.[Sale Key]; The query returns the following (abbreviated) result. Customer SaleKey Quantity Q_RT Q_MA ---------------------------- -------- ----------- ----------- ----------- Tailspin Toys (Absecon, NJ) 2869 216 216 216 Tailspin Toys (Absecon, NJ) 2870 2 218 109 Tailspin Toys (Absecon, NJ) 2871 2 220 73 Let's find the top three orders by quantity for the Tailspin Toys (Aceitunas, PR) customer! You can do this by using the OFFSET…FETCH clause after the ORDER BY clause, like the following query shows. SELECT c.Customer, f.[Sale Key] AS SaleKey, f.Quantity FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.Customer = N'Tailspin Toys (Aceitunas, PR)' ORDER BY f.Quantity DESC OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY; This is the complete result of the query. Customer SaleKey Quantity ------------------------------ -------- ----------- Tailspin Toys (Aceitunas, PR) 36964 288 Tailspin Toys (Aceitunas, PR) 126253 250 Tailspin Toys (Aceitunas, PR) 79272 250 But wait… Didn't the second largest quantity, the value 250, repeat three times? Which two rows were selected in the output? Again, because the calculation is done over a non-unique column, the result is somehow nondeterministic. SQL Server offers another possibility, the TOP clause. You can specify TOP n WITH TIES, meaning you can get all of the rows with ties on the last value in the output. However, this way you don't know the number of the rows in the output in advance. The following query shows this approach. SELECT TOP 3 WITH TIES c.Customer, f.[Sale Key] AS SaleKey, f.Quantity FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.Customer = N'Tailspin Toys (Aceitunas, PR)' ORDER BY f.Quantity DESC; This is the complete result of the previous query – this time it is four rows. Customer SaleKey Quantity ------------------------------ -------- ----------- Tailspin Toys (Aceitunas, PR) 36964 288 Tailspin Toys (Aceitunas, PR) 223106 250 Tailspin Toys (Aceitunas, PR) 126253 250 Tailspin Toys (Aceitunas, PR) 79272 250 The next task is to get the top three orders by quantity for each customer. You need to perform the calculation for each customer. The APPLY Transact-SQL operator comes handy here. You use it in the FROM clause. You apply, or execute, a table expression defined on the right side of the operator once for each row of the input rowset from the left side of the operator. There are two flavors of this operator. The CROSS APPLY version filters out the rows from the left rowset if the tabular expression on the right side does not return any row. The OUTER APPLY version preserves the row from the left side, even is the tabular expression on the right side does not return any row, similarly as the LEFT OUTER JOIN does. Of course, columns for the preserved rows do not have known values from the right-side tabular expression. The following query uses the CROSS APPLY operator to calculate top three orders by quantity for each customer that actually does have some orders. SELECT c.Customer, t3.SaleKey, t3.Quantity FROM Dimension.Customer AS c CROSS APPLY (SELECT TOP(3) f.[Sale Key] AS SaleKey, f.Quantity FROM Fact.Sale AS f WHERE f.[Customer Key] = c.[Customer Key] ORDER BY f.Quantity DESC) AS t3 WHERE c.[Customer Key] <> 0 ORDER BY c.Customer, t3.Quantity DESC; Below is the result of this query, shortened to first nine rows. Customer SaleKey Quantity ---------------------------------- -------- ----------- Tailspin Toys (Absecon, NJ) 5620 360 Tailspin Toys (Absecon, NJ) 114397 324 Tailspin Toys (Absecon, NJ) 82868 288 Tailspin Toys (Aceitunas, PR) 36964 288 Tailspin Toys (Aceitunas, PR) 126253 250 Tailspin Toys (Aceitunas, PR) 79272 250 Tailspin Toys (Airport Drive, MO) 43184 250 Tailspin Toys (Airport Drive, MO) 70842 240 Tailspin Toys (Airport Drive, MO) 630 225 For the final task in this section, assume that you need to calculate some statistics over totals of customers' orders. You need to calculate the average total amount for all customers, the standard deviation of this total amount, and the average count of total count of orders per customer. This means you need to calculate the totals over customers in advance, and then use aggregate functions AVG() and STDEV() on these aggregates. You could do aggregations over customers in advance in a derived table. However, there is another way to achieve this. You can define the derived table in advance, in the WITH clause of the SELECT statement. Such subquery is called a common table expression, or a CTE. CTEs are more readable than derived tables, and might be also more efficient. You could use the result of the same CTE multiple times in the outer query. If you use derived tables, then you need to define them multiple times if you want to use the multiple times in the outer query. The following query shows the usage of a CTE to calculate the average total amount for all customers, the standard deviation of this total amount, and the average count of total count of orders per customer. WITH CustomerSalesCTE AS ( SELECT c.Customer, SUM(f.[Total Excluding Tax]) AS TotalAmount, COUNT(*) AS InvoiceLinesCount FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS c ON f.[Customer Key] = c.[Customer Key] WHERE c.[Customer Key] <> 0 GROUP BY c.Customer ) SELECT ROUND(AVG(TotalAmount), 6) AS AvgAmountPerCustomer, ROUND(STDEV(TotalAmount), 6) AS StDevAmountPerCustomer, AVG(InvoiceLinesCount) AS AvgCountPerCustomer FROM CustomerSalesCTE; It returns the following result. AvgAmountPerCustomer StDevAmountPerCustomer AvgCountPerCustomer --------------------- ---------------------- ------------------- 270479.217661 38586.082621 358 Transactions and Error Handling In a real world application, errors always appear. Syntax or even logical errors can be in the code, the database design might be incorrect, there might even be a bug in the database management system you are using. Even is everything works correctly, you might get an error because the users insert wrong data. With Transact-SQL error handling you can catch such user errors and decide what to do upon them. Typically, you want to log the errors, inform the users about the errors, and sometimes even correct them in the error handling code. Error handling for user errors works on the statement level. If you send SQL Server a batch of two or more statements and the error is in the last statement, the previous statements execute successfully. This might not be what you desire. Many times you need to execute a batch of statements as a unit, and fail all of the statements if one of the statements fails. You can achieve this by using transactions. You will learn in this section about: Error handling Transaction management Error Handling You can see there is a need for error handling by producing an error. The following code tries to insert an order and a detail row for this order. EXEC dbo.InsertSimpleOrder @OrderId = 6, @OrderDate = '20160706', @Customer = N'CustE'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 6, @ProductId = 2, @Quantity = 0; In SQL Server Management Studio, you can see that an error happened. You should get a message that the error 547 occurred, that The INSERT statement conflicted with the CHECK constraint. If you remember, in order details only rows where the value for the quantity is not equal to zero are allowed. The error occurred in the second statement, in the call of the procedure that inserts an order detail. The procedure that inserted an order executed without an error. Therefore, an order with id equal to six must be in the dbo. SimpleOrders table. The following code tries to insert order six again. EXEC dbo.InsertSimpleOrder @OrderId = 6, @OrderDate = '20160706', @Customer = N'CustE'; Of course, another error occurred. This time it should be error 2627, a violation of the PRIMARY KEY constraint. The values of the OrderId column must be unique. Let's check the state of the data after these successful and unsuccessful inserts. SELECT o.OrderId, o.OrderDate, o.Customer, od.ProductId, od.Quantity FROM dbo.SimpleOrderDetails AS od RIGHT OUTER JOIN dbo.SimpleOrders AS o ON od.OrderId = o.OrderId WHERE o.OrderId > 5 ORDER BY o.OrderId, od.ProductId; The previous query checks only orders and their associated details where the order id value is greater than five. The query returns the following result set. OrderId OrderDate Customer ProductId Quantity ----------- ---------- -------- ----------- ----------- 6 2016-07-06 CustE NULL NULL You can see that only the first insert of the order with the id 6 succeeded. The second insert of an order with the same id and the insert of the detail row for the order six did not succeed. You start handling errors by enclosing the statements in the batch you are executing in the BEGIN TRY … END TRY block. You can catch the errors in the BEGIN CATCH … END CATCH block. The BEGIN CATCH statement must be immediately after the END TRY statement. The control of the execution is passed from the try part to the catch part immediately after the first error occurs. In the catch part, you can decide how to handle the errors. If you want to log the data about the error or inform an end user about the details of the error, the following functions might be very handy: ERROR_NUMBER() – this function returns the number of the error. ERROR_SEVERITY() - it returns the severity level. The severity of the error indicates the type of problem encountered. Severity levels 11 to 16 can be corrected by the user. ERROR_STATE() – this function returns the error state number. Error state gives more details about a specific error. You might want to use this number together with the error number to search Microsoft knowledge base for the specific details of the error you encountered. ERROR_PROCEDURE() – it returns the name of the stored procedure or trigger where the error occurred, or NULL if the error did not occur within a stored procedure or trigger. ERROR_LINE() – it returns the line number at which the error occurred. This might be the line number in a routine if the error occurred within a stored procedure or trigger, or the line number in the batch. ERROR_MESSAGE() – this function returns the text of the error message. The following code uses the try…catch block to handle possible errors in the batch of the statements, and returns the information of the error using the above mentioned functions. Note that the error happens in the first statement of the batch. BEGIN TRY EXEC dbo.InsertSimpleOrder @OrderId = 6, @OrderDate = '20160706', @Customer = N'CustF'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 6, @ProductId = 2, @Quantity = 5; END TRY BEGIN CATCH SELECT ERROR_NUMBER() AS ErrorNumber, ERROR_MESSAGE() AS ErrorMessage, ERROR_LINE() as ErrorLine; END CATCH There was a violation of the PRIMARY KEY constraint again, because the code tried to insert an order with id six again. The second statement would succeed if you would execute in its own batch, without error handling. However, because of the error handling, the control was passed to the catch block immediately after the error in the first statement, and the second statement never executed. You can check the data with the following query. SELECT o.OrderId, o.OrderDate, o.Customer, od.ProductId, od.Quantity FROM dbo.SimpleOrderDetails AS od RIGHT OUTER JOIN dbo.SimpleOrders AS o ON od.OrderId = o.OrderId WHERE o.OrderId > 5 ORDER BY o.OrderId, od.ProductId; The result set should be the same as the results set of the last check of the orders with id greater than five – a single order without details. The following code produces an error in the second statement. BEGIN TRY EXEC dbo.InsertSimpleOrder @OrderId = 7, @OrderDate = '20160706', @Customer = N'CustF'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 7, @ProductId = 2, @Quantity = 0; END TRY BEGIN CATCH SELECT ERROR_NUMBER() AS ErrorNumber, ERROR_MESSAGE() AS ErrorMessage, ERROR_LINE() as ErrorLine; END CATCH You can see that the insert of the order detail violates the CHECK constraint for the quantity. If you check the data with the same query as last two times again, you would see that there are orders with id six and seven in the data, both without order details. Using Transactions Your business logic might request that the insert of the first statement fails when the second statement fails. You might need to repeal the changes of the first statement on the failure of the second statement. You can define that a batch of statements executes as a unit by using transactions. The following code shows how to use transactions. Again, the second statement in the batch in the try block is the one that produces an error. BEGIN TRY BEGIN TRANSACTION EXEC dbo.InsertSimpleOrder @OrderId = 8, @OrderDate = '20160706', @Customer = N'CustG'; EXEC dbo.InsertSimpleOrderDetail @OrderId = 8, @ProductId = 2, @Quantity = 0; COMMIT TRANSACTION END TRY BEGIN CATCH SELECT ERROR_NUMBER() AS ErrorNumber, ERROR_MESSAGE() AS ErrorMessage, ERROR_LINE() as ErrorLine; IF XACT_STATE() <> 0 ROLLBACK TRANSACTION; END CATCH You can check the data again. SELECT o.OrderId, o.OrderDate, o.Customer, od.ProductId, od.Quantity FROM dbo.SimpleOrderDetails AS od RIGHT OUTER JOIN dbo.SimpleOrders AS o ON od.OrderId = o.OrderId WHERE o.OrderId > 5 ORDER BY o.OrderId, od.ProductId; Here is the result of the check: OrderId OrderDate Customer ProductId Quantity ----------- ---------- -------- ----------- ----------- 6 2016-07-06 CustE NULL NULL 7 2016-07-06 CustF NULL NULL You can see that the order with id 8 does not exist in your data. Because of the insert of the detail row for this order failed, the insert of the order was rolled back as well. Note that in the catch block, the XACT_STATE() function was used to check whether the transaction still exists. If the transaction was rolled back automatically by SQL Server, then the ROLLBACK TRANSACTION would produce a new error. The following code drops the objects (in correct order, due to object contraints) created for the explanation of the DDL and DML statements, programmatic objects, error handling, and transactions. DROP FUNCTION dbo.Top2OrderDetails; DROP VIEW dbo.OrdersWithoutDetails; DROP PROCEDURE dbo.InsertSimpleOrderDetail; DROP PROCEDURE dbo.InsertSimpleOrder; DROP TABLE dbo.SimpleOrderDetails; DROP TABLE dbo.SimpleOrders; Beyond Relational The "beyond relational" is actually only a marketing term. The relational model, used in the relational database management system, is nowhere limited to specific data types, or specific languages only. However, with the term beyond relational, we typically mean specialized and complex data types that might include spatial and temporal data, XML or JSON data, and extending the capabilities of the Transact-SQL language with CLR languages like Visual C#, or statistical languages like R. SQL Server in versions before 2016 already supports some of the features mentioned. Here is a quick review of this support that includes: Spatial data CLR support XML data Defining Locations and Shapes with Spatial Data In modern applications, many times you want to show your data on a map, using the physical location. You might also want to show the shape of the objects that your data describes. You can use spatial data for tasks like these. You can represent the objects with points, lines, or polygons. From the simple shapes you can create complex geometrical objects or geographical objects, for example cities and roads. Spatial data appear in many contemporary database. Acquiring spatial data has become quite simple with the Global Positioning System (GPS) and other technologies. In addition, many software packages and database management systems help you working with spatial data. SQL Server supports two spatial data types, both implemented as .NET common language runtime (CLR) data types, from version 2008: The geometry type represents data in a Euclidean (flat) coordinate system. The geography type represents data in a round-earth coordinate system. We need two different spatial data types because of some important differences between them. These differences include units of measurement and orientation. In the planar, or flat-earth, system, you define the units of measurements. The length of a distance and the surface of an area are given in the same unit of measurement as you use for the coordinates of your coordinate system. You as the database developer know what the coordinates mean and what the unit of measure is. In geometry, the distance between the points described with the coordinates (1, 3) and (4, 7) is 5 units, regardless of the units used. You, as the database developer who created the database where you are storing this data, know the context. You know what these 5 units mean, is this 5 kilometers, or 5 inches. When talking about locations on earth, coordinates are given in degrees of latitude and longitude. This is the round-earth, or ellipsoidal system Lengths and areas are usually measured in the metric system, in meters and square meters. However, not everywhere in the world the metric system is used for the spatial data. The spatial reference identifier (SRID) of the geography instance defines the unit of measure. Therefore, whenever measuring some distance or area in the ellipsoidal system, you should always quote also the SRID used, which defines the units. In the planar system, the ring orientation of a polygon is not an important factor. For example, a polygon described by the points ((0, 0), (10, 0), (0, 5), (0, 0)) is the same as a polygon described by ((0, 0), (5, 0), (0, 10), (0, 0)). You can always rotate the coordinates appropriately to get the same feeling of the orientation. However, in geography, the orientation is needed to completely describe a polygon. Just think of the equator, which divides the earth in the two hemispheres. Is your spatial data describing the northern or southern hemisphere? The Wide World Importers data warehouse includes the city location in the Dimension.City table. The following query retrieves it for cities in the main part of the USA> SELECT City, [Sales Territory] AS SalesTerritory, Location AS LocationBinary, Location.ToString() AS LocationLongLat FROM Dimension.City WHERE [City Key] <> 0 AND [Sales Territory] NOT IN (N'External', N'Far West'); Here is the partial result of the query. City SalesTerritory LocationBinary LocationLongLat ------------ --------------- -------------------- ------------------------------- Carrollton Mideast 0xE6100000010C70... POINT (-78.651695 42.1083969) Carrollton Southeast 0xE6100000010C88... POINT (-76.5605078 36.9468152) Carrollton Great Lakes 0xE6100000010CDB... POINT (-90.4070632 39.3022693) You can see that the location is actually stored as a binary string. When you use the ToString() method of the location, you get the default string representation of the geographical point, which is the degrees of longitude and latitude. If SSMS, you send the results of the previous query to a grid, you get in the results pane also an additional representation for the spatial data. Click the Spatial results tab, and you can see the points represented in the longitude – latitude coordinate system, like you can see in the following figure. Figure 2-1: Spatial results showing customers' locations If you executed the query, you might have noticed that the spatial data representation control in SSMS has some limitations. It can show only 5,000 objects. The result displays only first 5,000 locations. Nevertheless, as you can see from the previous figure, this is enough to realize that these points form a contour of the main part of the USA. Therefore, the points represent the customers' locations for customers from USA. The following query gives you the details, like location and population, for Denver, Colorado. SELECT [City Key] AS CityKey, City, [State Province] AS State, [Latest Recorded Population] AS Population, Location.ToString() AS LocationLongLat FROM Dimension.City WHERE [City Key] = 114129 AND [Valid To] = '9999-12-31 23:59:59.9999999'; Spatial data types have many useful methods. For example, the STDistance() method returns the shortest line between two geography types. This is a close approximate to the geodesic distance, defined as the shortest route between two points on the Earth's surface. The following code calculates this distance between Denver, Colorado, and Seattle, Washington. DECLARE @g AS GEOGRAPHY; DECLARE @h AS GEOGRAPHY; DECLARE @unit AS NVARCHAR(50); SET @g = (SELECT Location FROM Dimension.City WHERE [City Key] = 114129); SET @h = (SELECT Location FROM Dimension.City WHERE [City Key] = 108657); SET @unit = (SELECT unit_of_measure FROM sys.spatial_reference_systems WHERE spatial_reference_id = @g.STSrid); SELECT FORMAT(@g.STDistance(@h), 'N', 'en-us') AS Distance, @unit AS Unit; The result of the previous batch is below. Distance Unit ------------- ------ 1,643,936.69 metre Note that the code uses the sys.spatial_reference_system catalog view to get the unit of measure for the distance of the SRID used to store the geographical instances of data. The unit is meter. You can see that the distance between Denver, Colorado, and Seattle, Washington, is more than 1,600 kilometers. The following query finds the major cities within a circle of 1,000 km around Denver, Colorado. Major cities are defined as the cities with population larger than 200,000. DECLARE @g AS GEOGRAPHY; SET @g = (SELECT Location FROM Dimension.City WHERE [City Key] = 114129); SELECT DISTINCT City, [State Province] AS State, FORMAT([Latest Recorded Population], '000,000') AS Population, FORMAT(@g.STDistance(Location), '000,000.00') AS Distance FROM Dimension.City WHERE Location.STIntersects(@g.STBuffer(1000000)) = 1 AND [Latest Recorded Population] > 200000 AND [City Key] <> 114129 AND [Valid To] = '9999-12-31 23:59:59.9999999' ORDER BY Distance; Here is the result abbreviated to the twelve closest cities to Denver, Colorado. City State Population Distance ----------------- ----------- ----------- ----------- Aurora Colorado 325,078 013,141.64 Colorado Springs Colorado 416,427 101,487.28 Albuquerque New Mexico 545,852 537,221.38 Wichita Kansas 382,368 702,553.01 Lincoln Nebraska 258,379 716,934.90 Lubbock Texas 229,573 738,625.38 Omaha Nebraska 408,958 784,842.10 Oklahoma City Oklahoma 579,999 809,747.65 Tulsa Oklahoma 391,906 882,203.51 El Paso Texas 649,121 895,789.96 Kansas City Missouri 459,787 898,397.45 Scottsdale Arizona 217,385 926,980.71 There are many more useful methods and properties implemented in the two spatial data types. In addition, you can improve the performance of spatial queries with help of specialized spatial indexes. Please refer to the MSDN article "Spatial Data (SQL Server)" at https://msdn.microsoft.com/en-us/library/bb933790.aspx for more details on the spatial data types, their methods, and spatial indexes. XML Support in SQL Server SQL Server in version 2005 also started to feature extended support for XML data inside the database engine, although some basic support was already included in version 2000. The support starts by generating XML data from tabular results. You can use the FOR XML clause of the SELECT statement for this task. The following query generates an XML document from the regular tabular result set by using the FOR XML clause with AUTO option, to generate element-centric XML instance, with namespace and inline schema included. SELECT c.[Customer Key] AS CustomerKey, c.[WWI Customer ID] AS CustomerId, c.[Customer], c.[Buying Group] AS BuyingGroup, f.Quantity, f.[Total Excluding Tax] AS Amount, f.Profit FROM Dimension.Customer AS c INNER JOIN Fact.Sale AS f ON c.[Customer Key] = f.[Customer Key] WHERE c.[Customer Key] IN (127, 128) FOR XML AUTO, ELEMENTS, ROOT('CustomersOrders'), XMLSCHEMA('CustomersOrdersSchema'); GO Here is the partial result of this query. First part of the result is the inline schema/ <CustomersOrders> <xsd:schema targetNamespace="CustomersOrdersSchema" … <xsd:import namespace="http://schemas.microsoft.com/sqlserver/2004/sqltypes" … <xsd:element name="c"> <xsd:complexType> <xsd:sequence> <xsd:element name="CustomerKey" type="sqltypes:int" /> <xsd:element name="CustomerId" type="sqltypes:int" /> <xsd:element name="Customer"> <xsd:simpleType> <xsd:restriction base="sqltypes:nvarchar" … <xsd:maxLength value="100" /> </xsd:restriction> </xsd:simpleType> </xsd:element> … </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema> <c > <CustomerKey>127</CustomerKey> <CustomerId>127</CustomerId> <Customer>Tailspin Toys (Point Roberts, WA)</Customer> <BuyingGroup>Tailspin Toys</BuyingGroup> <f> <Quantity>3</Quantity> <Amount>48.00</Amount> <Profit>31.50</Profit> </f> <f> <Quantity>9</Quantity> <Amount>2160.00</Amount> <Profit>1363.50</Profit> </f> </c> <c > <CustomerKey>128</CustomerKey> <CustomerId>128</CustomerId> <Customer>Tailspin Toys (East Portal, CO)</Customer> <BuyingGroup>Tailspin Toys</BuyingGroup> <f> <Quantity>84</Quantity> <Amount>420.00</Amount> <Profit>294.00</Profit> </f> </c> … </CustomersOrders> You can also do the opposite process: convert XML to tables. Converting XML to relational tables is known as shredding XML. You can do this by using the nodes() method of the XML data type or with the OPENXML() rowset function. Inside SQL Server, you can also query the XML data from Transact-SQL to find specific elements, attributes, or XML fragments. XQuery is a standard language for browsing XML instances and returning XML, and is supported inside XML data type methods. You can store XML instances inside SQL Server database in a column of the XML data type. An XML data type includes five methods that accept XQuery as a parameter. The methods support querying (the query() method), retrieving atomic values (the value() method), existence checks (the exist() method), modifying sections within the XML data (the modify() method) as opposed to overriding the whole thing, and shredding XML data into multiple rows in a result set (the nodes() method). The following code creates a variable of the XML data type to store an XML instance in it. Then it uses the query() method to return XML fragments from the XML instance. This method accepts XQuery query as a parameter. The XQuery query uses the FLWOR expressions to define and shape the XML returned. DECLARE @x AS XML; SET @x = N' <CustomersOrders> <Customer custid="1"> <!-- Comment 111 --> <companyname>CustA</companyname> <Order orderid="1"> <orderdate>2016-07-01T00:00:00</orderdate> </Order> <Order orderid="9"> <orderdate>2016-07-03T00:00:00</orderdate> </Order> <Order orderid="12"> <orderdate>2016-07-12T00:00:00</orderdate> </Order> </Customer> <Customer custid="2"> <!-- Comment 222 --> <companyname>CustB</companyname> <Order orderid="3"> <orderdate>2016-07-01T00:00:00</orderdate> </Order> <Order orderid="10"> <orderdate>2016-07-05T00:00:00</orderdate> </Order> </Customer> </CustomersOrders>'; SELECT @x.query('for $i in CustomersOrders/Customer/Order let $j := $i/orderdate where $i/@orderid < 10900 order by ($j)[1] return <Order-orderid-element> <orderid>{data($i/@orderid)}</orderid> {$j} </Order-orderid-element>') AS [Filtered, sorted and reformatted orders with let clause]; Here is the result of the previous query. <Order-orderid-element> <orderid>1</orderid> <orderdate>2016-07-01T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>3</orderid> <orderdate>2016-07-01T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>9</orderid> <orderdate>2016-07-03T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>10</orderid> <orderdate>2016-07-05T00:00:00</orderdate> </Order-orderid-element> <Order-orderid-element> <orderid>12</orderid> <orderdate>2016-07-12T00:00:00</orderdate> </Order-orderid-element> Summary In this article, you got a review of the SQL Server features for developers that exists already in the previous versions. You can see that this support goes well beyond basic SQL statements, and also beyond pure Transact-SQL. Resources for Article: Further resources on this subject: Configuring a MySQL linked server on SQL Server 2008 [article] Basic Website using Node.js and MySQL database [article] Exception Handling in MySQL for Python [article]
Read more
  • 0
  • 0
  • 1780
Modal Close icon
Modal Close icon