How-To Tutorials

26 Apr 2016

11 min read

Using NoSQL Databases

26 Apr 2016

In this article by Valentin Bojinov, the author of the book RESTful Web API Design with Node.JS, Second Edition, we willlook for a better storage solution, which can be scalable easily, together with our REST-enabled application. These days, the so-called NoSQL databases are used heavily in cloud environments. They have the following advantages over traditional transactional SQL databases: They are schemaless; that is, they work with object representations rather than store the object state in one or several tables, depending on their complexity. They are extendable, because they store an actual object. Data evolution is supported implicitly, so all you need to do is just call the operation that stores the object. They are designed to be highly distributed and scalable. Nearly all modern NoSQL solutions out there support clustering and can scale further, along with the load of your application. Additionally, most of them have REST-enabled interfaces over HTTP, which eases their usage over a load balancer in high-availability scenarios. Classical database drivers are usually not available for traditional client-side languages, such as JavaScript, because they require native libraries or drivers. However, the idea of NoSQL originated from using document data stores. Thus, most of them support the JSON format, which is native to JavaScript. Last but not least, most NoSQL solutions are open source and are available for free, with all the benefits that open source projects offer: community, examples, and freedom! In this article, we will take a look at two NoSQL solutions: LevelDB and MongoDB. We will see how to design and test our database models, and finally, we will take a brief look at the content delivery network (CDN) infrastructures (For more resources related to this topic, see here.) Key/value store – LevelDB The first data store we will look at is LevelDB. It is an open source implementation developed by Google and written in C++. It is supported by a wide range of platforms, including Node.js. LevelDB is a key/value store; both the key and value are represented as binary data, so their content can vary from simple strings to binary representations of serialized objects in any format, such as JSON or XML. As it is a key/value data store, working with it is similar to working with an associative array—a key identifies an object uniquely within the store. Furthermore, the keys are stored as sorted for better performance. But what makes LevelDB perform better than an arbitrary file storage implementation? Well, it uses a "log-structured merge" topology, which stores all write operations in an in-memory log, transferred (flushed) regularly to a permanent storage called Sorted String Table (SST) files. Read operations first attempt to retrieve entries from a cache containing the most commonly returned results. The size of the reading cache and the flush interval of the writing log are configurable parameters, which can be further adjusted in order to be adequate for the application load. The following image shows this topology: The storage is a collection of string-sorted files with a maximum size of about 2 MB. Each file consists of 4 KB segments that are readable by a single read operation. The table files are not sorted in a straightforward manner, but are organized into levels. The log level is on top, before all other levels. It is always flushed to level 0, which consists of at most four SST files. When filled, one STS file is compacted to a lower level, that is, level 1. The maximum size of level 1 is 10 MB. When it gets filled, a file goes from level 1 to level 2. LevelDB assumes that the size of each lower level is ten times larger than the size of the previous level. So we have the following level structure: Log with a configurable size Level 0, consisting of four SST files Level 1, with a maximum size of 10 MB Level 2, with a maximum size of 100 MB Level 3, with a maximum size of 1000 MB Level n, with a maximum size of the previous level multiplied by 10 – (n-1)*10 MB The hierarchical structure of this topology assures that newer data stays in the top levels, while older data is somewhere in the lower levels. A read operation always starts searching for a given key in the cache, and if it is not found there, the operation traverses through each level until the entry is found. An entry is considered non-existing if its key is not found anywhere within all levels. LevelDB provides get, put, and delete operations to manipulate data records as well as a batch operation that can be used to perform multiple data manipulations atomically; that is, either all or none of the operations in the batch are executed successfully. LevelDB can optionally use a compression library in order to reduce the size of the stored values. This compression is provided by Google's Snappy compression library. It is highly optimized for fast compression with low performance impact, so too high expectations should not be expected for a large compression ratio. There are two popular libraries that enable LevelDB usage in Node: LevelDOWN and LevelUP. Initially, LevelDOWN was acting as foundation binding, implicitly provided with LevelUP, but after version 0.9, it had been extracted out of it and became available as a standalone binding for LevelDB. Currently, LevelUP has no explicit dependency on LevelDOWN defined. It needs to be installed separately, as LevelUP expects it to be available on its Node's require() path. LevelDOWN is a pure C++ interface used to bind Node and LevelDB. Though it is slightly faster than LevelUP, it has some state safety and API considerations, which make it less preferable than LevelUP. To be concrete, LevelDOWN does not keep track of the state of the underlying instance. Thus, it is up to the developers themselves not to open a connection more than once or use a data manipulating operation against a closed database connection, as this will cause errors. LevelUP provides state-safe operations out the box. Thus, it prevents out-of-state operations from being sent to its foundation—LevelDOWN. Let's move on to installing LevelUP by executing the following npm command: npm install levelup leveldown Even though the LevelUP module can be installed without LevelDOWN, it will not work at runtime, complaining that it can't find an underlying dependency. Enough theory! Let's see what the LevelUP API looks like. The following code snippet instantiates LevelDB and inserts a dummy contact record into it. It also exposes a /contacts/:number route so that this very record can be returned as a JSON output if queried appropriately. Let's use it in a new project in the Enide studio, in a file named levelup.js: var express = require('express') , http = require('http') , path = require('path') , bodyParser = require('body-parser') , logger = require('morgan') , methodOverride = require('method-override') , errorHandler = require('errorhandler') , levelup = require('levelup'); var app = express(); var url = require('url'); // all environments app.set('port', process.env.PORT || 3000); app.set('views', __dirname + '/views'); app.set('view engine', 'jade'); app.use(methodOverride()); app.use(bodyParser.json()); // development only if ('development' == app.get('env')) { app.use(errorHandler()); } var db = levelup('./contact', {valueEncoding: 'json'}); db.put('+359777123456', { "firstname": "Joe", "lastname": "Smith", "title": "Mr.", "company": "Dev Inc.", "jobtitle": "Developer", "primarycontactnumber": "+359777123456", "othercontactnumbers": [ "+359777456789", "+359777112233"], "primaryemailaddress": "joe.smith@xyz.com", "emailaddresses": [ "j.smith@xyz.com"], "groups": ["Dev","Family"] }); app.get('/contacts/:number', function(request, response) { console.log(request.url + ' : querying for ' + request.params.number); db.get(request.params.number, function(error, data) { if (error) { response.writeHead(404, { 'Content-Type' : 'text/plain'}); response.end('Not Found'); return; } response.setHeader('content-type', 'application/json'); response.send(data); }); }); console.log('Running at port ' + app.get('port')); http.createServer(app).listen(app.get('port')); As the contact is inserted into LevelDB before the HTTP server is created, the record identified with the +359777123456 key will be available in the database when we execute our first GET request. But before requesting any data, let's take a closer look at the usage of LevelUP. The get() function of LevelDB takes two arguments: The first argument is the key to be used in the query. The second argument is a handler function used to process the results. It also has two additional arguments: A Boolean value, specifying whether an error has occurred during the query The actual result entity from the database. Let's start it with Node's levelup.js and execute some test requests with the REST Client tool to http://localhost:3000/contacts/%2B359777123456. This can be seen in the following screenshot: Expectedly the response is a JSON representation of the contact inserted statically in LevelUP during the initialization of the application. Requesting any other key will result in an "HTTP 404 Not found" response. This example demonstrates how to bind a LevelUP operation to an HTTP operation and process its results, but currently, it lacks support for inserting, editing, and deleting data. We will improve that with the next sample. It binds the HTTP's GET, PUT, and DELETE operations exposed via an express route, /contacts/:number, to the LevelDB's get, put, and del handlers: var express = require('express') , http = require('http') , path = require('path') , bodyParser = require('body-parser') , logger = require('morgan') , methodOverride = require('method-override') , errorHandler = require('errorhandler') , levelup = require('levelup'); var app = express(); var url = require('url'); // all environments app.set('port', process.env.PORT || 3000); app.set('views', __dirname + '/views'); app.set('view engine', 'jade'); app.use(methodOverride()); app.use(bodyParser.json()); // development only if ('development' == app.get('env')) { app.use(errorHandler()); } var db = levelup('./contact', {valueEncoding: 'json'}); app.get('/contacts/:number', function(request, response) { console.log(request.url + ' : querying for ' + request.params.number); db.get(request.params.number, function(error, data) { if (error) { response.writeHead(404, { 'Content-Type' : 'text/plain'}); response.end('Not Found'); return; } response.setHeader('content-type', 'application/json'); response.send(data); }); }); app.post('/contacts/:number', function(request, response) { console.log('Adding new contact with primary number' + request.params.number); db.put(request.params.number, request.body, function(error) { if (error) { response.writeHead(500, { 'Content-Type' : 'text/plain'}); response.end('Internal server error'); return; } response.send(request.params.number + ' successfully inserted'); }); }); app.del('/contacts/:number', function(request, response) { console.log('Deleting contact with primary number' + request.params.number); db.del(request.params.number, function(error) { if (error) { response.writeHead(500, { 'Content-Type' : 'text/plain'}); response.end('Internal server error'); return; } response.send(request.params.number + ' successfully deleted'); }); }); app.get('/contacts', function(request, response) { console.log('Listing all contacts'); var is_first = true; response.setHeader('content-type', 'application/json'); db.createReadStream() .on('data', function (data) { console.log(data.value); if (is_first == true) { response.write('['); } else { response.write(','); } response.write(JSON.stringify(data.value)); is_first = false; }) .on('error', function (error) { console.log('Error while reading', error) }) .on('close', function () { console.log('Closing db stream');}) .on('end', function () { console.log('Db stream closed'); response.end(']'); }) }); console.log('Running at port ' + app.get('port')); http.createServer(app).listen(app.get('port')); Perhaps the most interesting part of the preceding sample is the handler of the /contacts route. It writes a JSON array of all the contacts available in the database to the output stream of the HTTP response. LevelUP's createInputStream method exposes a data handler for every key/value pair available. As LevelDB is not aware of the format of its values, we have to use the native JSON.stringify method to convert each value to a JSON object, based on which we can implement any kind of login. Let's assume we want a function that flushes to the HTTP response only those contacts whose first name is Joe. Then we will need to add filtering logic to the data handler: db.createReadStream() .on('data', function (data) { if (is_first == true) { response.write('['); } else { response.write(','); } if (data.value.lastname.toString() == 'Smith') { var jsonString = JSON.stringify(data.value) console.log('Adding Mr. ' + data.value.lastname + ' to the response'); response.write(jsonString); is_first = false; } else{ console.log('Skipping Mr. ' + data.value.lastname); } }) .on('error', function (error) { console.log('Error while reading', error) }) .on('close', function () { console.log('Closing db stream'); }) .on('end', function () { console.log('Db stream closed'); response.end(']'); }) This looks a bit artificial, doesn't it? Well, this is all that LevelDB can possibly offer us, since LevelDB can search only by a single key. This makes it an inappropriate option for data that has to be indexed by several different attributes. This is where document stores come into play. Summary In this article, we looked at one type of NoSQL database: LevelDB, a key/value datastore. We utilized it to implement automated test for the database layer. Resources for Article: Further resources on this subject: Node.js Fundamentals and Asynchronous JavaScript [article] An Introduction to Node.js Design Patterns [article] Making a Web Server in Node.js [article]

0
0
2264

How-To Tutorials

Packt

26 Apr 2016

10 min read

Advanced Shell Topics

Packt

26 Apr 2016

10 min read

In this article by Thomas Bitterman, the author of the book Mastering IPython 4.0, we will look at the tools the IPython interactive shell provides. With the split of the Jupyter and IPython projects, the command line provided by IPython will gain importance. This article covers the following topics: What is IPython? Installing IPython Starting out with the terminal IPython beyond Python Magic commands (For more resources related to this topic, see here.) What is IPython? IPython is an open source platform for interactive and parallel computing. It started with the realization that the standard Python interpreter was too limited for sustained interactive use, especially in the areas of scientific and parallel computing. Overcoming these limitations resulted in a three-part architecture: An enhanced, interactive shell Separation of the shell from the computational kernel A new architecture for parallel computing This article will provide a brief overview of the architecture before introducing some basic shell commands. Before proceeding further, however, IPython needs to be installed. Those readers with experience in parallel and high-performance computing but new to IPython will find the following sections useful in quickly getting up to speed. Those experienced with IPython may skim the next few sections, noting where things have changed now that the notebook is no longer an integral part of development. Installing IPython The first step in installing IPython is to install Python. Instructions for the various platforms differ, but the instructions for each can be found on the Python home page at http://www.python.org. IPython requires Python 2.7 or ≥ 3.3. This article will use 3.5. Both Python and IPython are open source software, so downloading and installation are free. A standard Python installation includes the pip package manager. pip is a handy command-line tool that can be used to download and install various Python libraries. Once Python is installed, IPython can be installed with this command: pip install ipython IPython comes with a test suite called iptest. To run it, simply issue the following command: iptest A series of tests will be run. It is possible (and likely on Windows) that some libraries will be missing, causing the associated tests to fail. Simply use pip to install those libraries and rerun the test until everything passes. It is also possible that all tests pass without an important library being installed. This is the readline library (also known as PyReadline). IPython will work without it but will be missing some features that are useful for the IPython terminal, such as command completion and history navigation. To install readline, use pip: pip install readline pip install gnureadline At this point, issuing the ipython command will start up an IPython interpreter: ipython IPython beyond Python No one would use IPython if it were not more powerful than the standard terminal. Much of IPython's power comes from two features: Shell integration Magic commands Shell integration Any command starting with ! is passed directly to the operating system to be executed, and the result is returned. By default, the output is then printed out to the terminal. If desired, the result of the system command can be assigned to a variable. The result is treated as a multiline string, and the variable is a list containing one string element per line of output. For example: In [22]: myDir = !dir In [23]: myDir Out[23]: [' Volume in drive C has no label.', ' Volume Serial Number is 1E95-5694', '', ' Directory of C:\Program Files\Python 3.5', '', '10/04/2015 08:43 AM <DIR> .', '10/04/2015 08:43 AM <DIR> ..',] While this functionality is not entirely absent in straight Python (the OS and subprocess libraries provide similar abilities), the IPython syntax is much cleaner. Additional functionalities such as input and output caching, directory history, and automatic parentheses are also included. History The previous examples have had lines that were prefixed by elements such as In[23] and Out[15]. In and Out are arrays of strings, where each element is either an input command or the resulting output. They can be referred to using the arrays notation, or "magic" commands can accept the subscript alone. Magic commands IPython also accepts commands that control IPython itself. These are called "magic" commands, and they start with % or %%. A complete list of magic commands can be found by typing %lsmagic in the terminal. Magics that start with a single % sign are called "line" magics. They accept the rest of the current line for arguments. Magics that start with %% are called "cell" magics. They accept not only the rest of the current line but also the following lines. There are too many magic commands to go over in detail, but there are some related families to be aware of: OS equivalents: %cd, %env, %pwd Working with code: %run, %edit, %save, %load, %load_ext, %%capture Logging: %logstart, %logstop, %logon, %logoff, %logstate Debugging: %debug, %pdb, %run, %tb Documentation: %pdef, %pdoc, %pfile, %pprint, %psource, %pycat, %%writefile Profiling: %prun, %time, %run, %time, %timeit Working with other languages: %%script, %%html, %%javascript, %%latex, %%perl, %%ruby With magic commands, IPython becomes a more full-featured development environment. A development session might include the following steps: Set up the OS-level environment with the %cd, %env, and ! commands. Set up the Python environment with %load and %load_ext. Create a program using %edit. Run the program using %run. Log the input/output with %logstart, %logstop, %logon, and %logoff. Debug with %pdb. Create documentation with %pdoc and %pdef. This is not a tenable workflow for a large project, but for exploratory coding of smaller modules, magic commands provide a lightweight support structure. Creating custom magic commands IPython supports the creation of custom magic commands through function decorators. Luckily, one does not have to know how decorators work in order to use them. An example will explain. First, grab the required decorator from the appropriate library: In [1]: from IPython.core.magic import(register_line_magic) Then, prepend the decorator to a standard IPython function definition: In [2]: @register_line_magic ...: def getBootDevice(line): ...: sysinfo = !systeminfo ...: for ln in sysinfo: ...: if ln.startswith("Boot Device"): ...: return(ln.split()[2]) ...: Your new magic is ready to go: In [3]: %getBootDevice Out[3]: '\Device\HarddiskVolume1' Some observations are in order: Note that the function is, for the most part, standard Python. Also note the use of the !systeminfo shell command. One can freely mix both standard Python and IPython in IPython. The name of the function will be the name of the line magic. The parameter, "line," contains the rest of the line (in case any parameters are passed). A parameter is required, although it need not be used. The Out associated with calling this line magic is the return value of the magic. Any print statements executed as part of the magic are displayed on the terminal but are not part of Out (or _). Cython We are not limited to writing custom magic commands in Python. Several languages are supported, including R and Octave. We will look at one in particular, Cython. Cython is a language that can be used to write C extensions for Python. The goal for Cython is to be a superset of Python, with support for optional static type declarations. The driving force behind Cython is efficiency. As a compiled language, there are performance gains to be had from running C code. The downside is that Python is much more productive in terms of programmer hours. Cython can translate Python code into compiled C code, achieving more efficient execution at runtime while retaining the programmer-friendliness of Python. The idea of turning Python into C is not new to Cython. The default and most widely used interpreter (CPython) for Python is written in C. In some sense then, running Python code means running C code, just through an interpreter. There are other Python interpreter implementations as well, including those in Java (Jython) and C# (IronPython). CPython has a foreign function interface to C. That is, it is possible to write C language functions that interface with CPython in such a way that data can be exchanged and functions invoked from one to the other. The primary use is to call C code from Python. There are, however, two primary drawbacks: writing code that works with the CPython foreign function interface is difficult in its own right; and doing so requires knowledge of Python, C, and CPython. Cython aims to remedy this problem by doing all the work of turning Python into C and interfacing with CPython internally to Cython. The programmer writes Cython code and leaves the rest to the Cython compiler. Cython is very close to Python. The primary difference is the ability to specify C types for variables using the cdef keyword. Cython then handles type checking and conversion between Python values and C values, scoping issues, marshalling and unmarshalling of Python objects into C structures, and other cross-language issues. Cython is enabled in IPython by loading an extension. In order to use the Cython extension, do this: In [1]: %load_ext Cython At this point, the cython cell magic can be invoked: In [2]: %%cython ...: def sum(int a, int b): ...: cdef int s = a+b ...: return s And the Cython function can now be called just as if it were a standard Python function: In [3]: sum(1, 1) Out[3]: 2 While this may seem like a lot of work for something that could have been written more easily in Python in the first place, that is the price to be paid for efficiency. If, instead of simply summing two numbers, a function is expensive to execute and is called multiple times (perhaps in a tight loop), it can be worth it to use Cython for a reduction in runtime. There are other languages that have merited the same treatment, GNU Octave and R among them. Summary In this article, we covered many of the basics of using IPython for development. We started out by just getting an instance of IPython running. The intrepid developer can perform all the steps by hand, but there are also various all-in-one distributions available that will include popular modules upon installation. By default, IPython will use the pip package managers. Again, the all-in-one distributions provide added value, this time in the form of advanced package management capability. At that point, all that is obviously available is a terminal, much like the standard Python terminal. IPython offers two additional sources of functionality, however: configuration and magic commands. Magic commands fall into several categories: OS equivalents, working with code, logging, debugging, documentation, profiling, and working with other languages among others. Add to this the ability to create custom magic commands (in IPython or another language) and the IPython terminal becomes a much more powerful alternative to the standard Python terminal. Also included in IPython is the debugger—ipdb. It is very similar to the Python pdb debugger, so it should be familiar to Python developers. All this is supported by the IPython architecture. The basic idea is that of a Read-Eval-Print loop in which the Eval section has been separated out into its own process. This decoupling allows different user interface components and kernels to communicate with each other, making for a flexible system. This flexibility extends to the development environment. There are IDEs devoted to IPython (for example, Spyder and Canopy) and others that originally targeted Python but also work with IPython (for example, Eclipse). There are too many Python IDEs to list, and many should work with an IPython kernel "dropped in" as a superior replacement to a Python interpreter. Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Scientific Computing APIs for Python [article] Overview of Process Management in Microsoft Visio 2013 [article]

0
0
1964

article-image-introducing-and-setting-go

Packt

26 Apr 2016

9 min read

Introducing and Setting Up GO

Packt

26 Apr 2016

9 min read

In this article by Nathan Kozyra, the author of the book Learning Go Web Development, one of the most common things you'll hear being said is that it's a systems language. Indeed, one of the earlier descriptions of Go, by the Go team itself, was that the language was built to be a modern systems language. It was constructed to combine the speed and the power of languages, such as C with the syntactical elegance and thrift of modern interpreted languages, such as Python. You can see that the goal is realized when you look at just a few snippets of the Go code. The Go FAQ, on why Go was created: "Go was born out of frustration with existing languages and environments for systems programming." Perhaps the largest part of the present-day systems programming comprises of designing backend servers. Obviously, the Web comprises a huge, but not exclusive, percentage of that world. Go hasn't been considered a web language until recently. Unsurprisingly, it took a few years of developers dabbling, experimenting, and finally embracing the language to start taking it to new avenues. While Go is web-ready out of the box, it lacks a lot of the critical frameworks and tools people so often take for granted with web development now. As the community around Go grew, the scaffolding began to manifest in a lot of new and exciting ways. Combined with existing ancillary tools, Go is now a wholly viable option for end-to-end web development. However, lets get back to the primary question: Why Go? To be fair, it's not right for every web project, but any application that can benefit from a high-performance, secure web-serving out of the box with the added benefits of a beautiful concurrency model would make for a good candidate. We're not going to deal with a lot of low-level aspects of the Go language. For example, we assume that you're familiar with variable and constant declaration. We assume that you understand control structures. In this article, we will cover the following topics: Installing Go Structuring a project Importing packages (For more resources related to this topic, see here.) Installing Go The most critical first step is, of course, making sure that Go is available and ready to start our first web server. While one of Go's biggest selling points is its cross-platform support (both building and using locally while targeting other operating systems), your life will be much more easier on a Nix compatible platform. If you're on Windows, don't fear. Natively, you may run into incompatible packages and firewall issues when running using Go run and some other quirks, but 95% of the Go ecosystem will be available to you. You can also, very easily, run a virtual machine and in fact that is a great way to simulate a potential production environment. In-depth installation instructions are available at https://golang.org/doc/install, but we'll talk about a few quirky points here before moving on. For OS X and Windows, Go is provided as a part of a binary installation packages. For any Linux platform with a package manager, things can be pretty easy. To install via common Linux package managers: Ubuntu: sudo apt-get golang CentOS: sudo yum install golang On both OS X and Linux, you'll need to add a couple of lines to your path—the GOPATH and PATH. First, you'll have to find the location of your Go binary's installation. This varies from distribution to distribution. Once you've found that, you can configure the PATH and GOPATH, as follows: export PATH=$PATH:/usr/local/go/bin export GOPATH="/usr/share/go" While the path to be used is not defined rigidly, some convention has coalesced around; starting at a subdirectory directly under your user's home directory, such as $HOME/go or ~Home/go. As long as this location is set perpetually and doesn't change, you won't run into issues with conflicts or missing packages. You can test the impact of these changes by running the go env command. If you see any issues with this, it means that your directories are not correct. Note that this may not prevent Go from running—depending on whether the GOBIN directory is properly set—but will prevent you from installing packages globally across your system. To test the installation, you can grab any Go package using a go get command and create a Go file somewhere. As a quick example, first get a package at random, we'll use a package from the Gorilla framework. go get github.com/gorilla/mux If this runs without any issue, Go is finding your GOPATH correctly. To make sure that Go is able to access your downloaded packages, draw up a very quick package that will attempt to utilize Gorilla's mux package and run it to verify whether the packages are found: package main import ( "fmt" "github.com/gorilla/mux" "net/http" ) func TestHandler(w http.ResponseWriter, r *http.Request) { } func main() { router := mux.NewRouter() router.HandleFunc("/test", TestHandler) http.Handle("/", router) fmt.Println("Everything is set up!") } Run go run test.go in the command line. It won't do much, but it will deliver the good news, as shown in the following screenshot: Structuring a project When you're first getting started and mostly playing around, there's no real problem with setting your application lazily. For example, to get started as quickly as possible, you can create a simple hello.go file anywhere you like and compile without any issue. But when you get into environments that require multiple or distinct packages (more on that shortly) or have more explicit cross-platform requirements, it makes sense to design your projects in a way that will facilitate the use of the Go build tool. The value of setting up your code in this manner lies in the way that the Go build tool works. If you have local (to your project) packages, the build tool will look in the src directory first and then in your GOPATH. When you're building for other platforms, Go build will utilize the local bin folder to organize the binaries. While building packages that are intended for mass use, you may also find that either starting your application under your GOPATH directory and then symbolically linking it to another directory or doing the opposite will allow you to develop without the need to subsequently go get your own code. Code conventions As with any language, being a part of the Go community means perpetual consideration of the way others create their code. Particularly if you're going to work in open source repositories, you'll want to generate your code the way others do, to reduce the amount of friction when people get or include your code. One incredibly helpful piece of tooling that the Go team has included is go fmt. fmt here, of course, means format and that's exactly what this tool does, it automatically formats your code according to the designed conventions. By enforcing style conventions, the Go team has helped to mitigate one of the most common and pervasive debates that exist among a lot of other languages. While the language communities tend to drive coding conventions, there are always little idiosyncrasies in the way individuals write programs. Let's use one of the most common examples around—where to put the opening bracket. Some programmers like it on the same line as the statement: for (int i = 0; i < 100; i++) { // do something } While others prefer it in the subsequent line: for (int i = 0; i < 100; i++) { // do something } These types of minor differences spark major, near-religious debates. The Gofmt tool helps alleviate this by allowing you to yield to Go's directive. Now, Go bypasses this obvious source of contention at the compiler, by formatting your code similar to the latter example discussed earlier. The compiler will complain and all you'll get is a fatal error. However, the other style choices have some flexibility, which are enforced when you use the tool to format. Here, for example, is a piece of code in Go before go fmt: func Double(n int) int { if (n == 0) { return 0 } else { return n * 2 } } Arbitrary whitespace can be the bane of a team's existence when it comes to sharing and reading code, particularly when every team member is not on the same IDE. By running go fmt, we clean this up, thereby translating our whitespace according to Go's conventions: func Double(n int) int { if n == 0 { return 0 } else { return n * 2 } } Long story short: always run go fmt before shipping or pushing your code. Importing packages Beyond the absolute and the most trivial application—one that cannot even produce a Hello World output—you must have some imported package in a Go application. To say Hello World, for example, we'd need some sort of a way to generate an output. Unlike in many other languages, even the core language library is accessible by a namespaced package. In Go, namespaces are handled by a repository endpoint URL, which is github.com/nkozyra/gotest, which can be opened directly on Github (or any other public location) for the review. Handling private repositories The go get tool easily handles packages hosted at the repositories, such as Github, Bitbucket, and Google Code (as well as a few others). You can also host your own projects, ideally a git project, elsewhere; although it might introduce some dependencies and sources for errors, which you'd probably like to avoid. But, what about the private repos? While go get is a wonderful tool, you'll find yourself looking at an error without some additional configuration with an SSH agent forwarding and so on. You can work around this in a couple of ways, but one very simple method is to clone the repository locally, using your version control software directly. Summary This article serves as an introduction to the most basic concepts of Go and producing for the Web in Go, but these points are critical foundational elements for being productive in the language and in the community. We've looked at coding conventions and package design and organization. Obviously, we're a long way from a real, mature application for the Web, but the building blocks are essential to getting there. Resources for Article: Further resources on this subject: ASP.NET 3.5 CMS: Adding Security and Membership (Part 2) [article] Working with Drupal Audio in Flash (part 2) [article] Posting on Your WordPress Blog [article]

0
0
1611

How-To Tutorials

article-image-mobile-forensics-and-its-challanges

Packt

25 Apr 2016

10 min read

Mobile Forensics and Its Challanges

Packt

25 Apr 2016

10 min read

In this article by Heather Mahalik and Rohit Tamma, authors of the book Practical Mobile Forensics, Second Edition, we will cover the following topics: Introduction to mobile forensics Challenges in mobile forensics (For more resources related to this topic, see here.) Why do we need mobile forensics? In 2015, there were more than 7 billion mobile cellular subscriptions worldwide, up from less than 1 billion in 2000, says International Telecommunication Union (ITU). The world is witnessing technology and user migration from desktops to mobile phones. The following figure sourced from statista.com shows the actual and estimated growth of smartphones from the year 2009 to 2018. Growth of smartphones from 2009 to 2018 in million units Gartner Inc. reports that global mobile data traffic reached 52 million terabytes (TB) in 2015, an increase of 59 percent from 2014, and the rapid growth is set to continue through 2018, when mobile data levels are estimated to reach 173 million TB. Smartphones of today, such as the Apple iPhone, Samsung Galaxy series, and BlackBerry phones, are compact forms of computers with high performance, huge storage, and enhanced functionalities. Mobile phones are the most personal electronic device that a user accesses. They are used to perform simple communication tasks, such as calling and texting, while still providing support for Internet browsing, e-mail, taking photos and videos, creating and storing documents, identifying locations with GPS services, and managing business tasks. As new features and applications are incorporated into mobile phones, the amount of information stored on the devices is continuously growing. Mobiles phones become portable data carriers, and they keep track of all your moves. With the increasing prevalence of mobile phones in peoples' daily lives and in crime, data acquired from phones become an invaluable source of evidence for investigations relating to criminal, civil, and even high-profile cases. It is rare to conduct a digital forensic investigation that does not include a phone. Mobile device call logs and GPS data were used to help solve the attempted bombing in Times Square, New York, in 2010. The details of the case can be found at http://www.forensicon.com/forensics-blotter/cell-phone-email-forensics-investigation-cracks-nyc-times-square-car-bombing-case/. The science behind recovering digital evidence from mobile phones is called mobile forensics. Digital evidence is defined as information and data that is stored on, received, or transmitted by an electronic device that is used for investigations. Digital evidence encompasses any and all digital data that can be used as evidence in a case. Mobile forensics Digital forensics is a branch of forensic science focusing on the recovery and investigation of raw data residing in electronic or digital devices. The goal of the process is to extract and recover any information from a digital device without altering the data present on the device. Over the years, digital forensics grew along with the rapid growth of computers and various other digital devices. There are various branches of digital forensics based on the type of digital device involved such as computer forensics, network forensics, mobile forensics, and so on. Mobile forensics is a branch of digital forensics related to the recovery of digital evidence from mobile devices. Forensically sound is a term used extensively in the digital forensics community to qualify and justify the use of particular forensic technology or methodology. The main principle for a sound forensic examination of digital evidence is that the original evidence must not be modified. This is extremely difficult with mobile devices. Some forensic tools require a communication vector with the mobile device, thus a standard write protection will not work during forensic acquisition. Other forensic acquisition methods may involve removing a chip or installing a bootloader on the mobile device prior to extract data for forensic examination. In cases where the examination or data acquisition is not possible without changing the configuration of the device, the procedure and the changes must be tested, validated, and documented. Following proper methodology and guidelines is crucial in examining mobile devices as it yields the most valuable data. As with any evidence gathering, not following the proper procedure during the examination can result in loss or damage of evidence or render it inadmissible in court. The mobile forensics process is broken into three main categories: seizure, acquisition, and examination/analysis. Forensic examiners face some challenges while seizing the mobile device as a source of evidence. At the crime scene, if the mobile device is found switched off, the examiner should place the device in a faraday bag to prevent changes should the device automatically power on. As shown in the following figure, Faraday bags are specifically designed to isolate the phone from the network. A Faraday bag (Image courtesy: http://www.amazon.com/Black-Hole-Faraday-Bag-Isolation/dp/B0091WILY0) If the phone is found switched on, switching it off has a lot of concerns attached to it. If the phone is locked by a PIN or password or encrypted, the examiner will be required to bypass the lock or determine the PIN to access the device. Mobile phones are networked devices and can send and receive data through different sources, such as telecommunication systems, Wi-Fi access points, and Bluetooth. So, if the phone is in a running state, a criminal can securely erase the data stored on the phone by executing a remote wipe command. When a phone is switched on, it should be placed in a faraday bag. If possible, prior to placing the mobile device in the faraday bag, disconnect it from the network to protect the evidence by enabling the flight mode and disabling all network connections (Wi-Fi, GPS, Hotspots, and so on). This will also preserve the battery, which will drain while in a faraday bag and protect against leaks in the faraday bag. Once the mobile device is seized properly, the examiner may need several forensic tools to acquire and analyze the data stored on the phone. Mobile phones are dynamic systems that present a lot of challenges to the examiner in extracting and analyzing digital evidence. The rapid increase in the number of different kinds of mobile phones from different manufacturers makes it difficult to develop a single process or tool to examine all types of devices. Mobile phones are continuously evolving as existing technologies progress and new technologies are introduced. Furthermore, each mobile is designed with a variety of embedded operating systems. Hence, special knowledge and skills are required from forensic experts to acquire and analyze the devices. Challenges in mobile forensics One of the biggest forensic challenges when it comes to the mobile platform is the fact that data can be accessed, stored, and synchronized across multiple devices. As the data is volatile and can be quickly transformed or deleted remotely, more effort is required for the preservation of this data. Mobile forensics is different from computer forensics and presents unique challenges to forensic examiners. Law enforcement and forensic examiners often struggle to obtain digital evidence from mobile devices. The following are some of the reasons: Hardware differences: The market is flooded with different models of mobile phones from different manufacturers. Forensic examiners may come across different types of mobile models, which differ in size, hardware, features, and operating system. Also, with a short product development cycle, new models emerge very frequently. As the mobile landscape is changing each passing day, it is critical for the examiner to adapt to all the challenges and remain updated on mobile device forensic techniques across various devices. Mobile operating systems: Unlike personal computers where Windows has dominated the market for years, mobile devices widely use more operating systems, including Apple's iOS, Google's Android, RIM's BlackBerry OS, Microsoft's Windows Mobile, HP's webOS, Nokia's Symbian OS, and many others. Even within these operating systems, there are several versions which make the task of forensic investigator even more difficult. Mobile platform security features: Modern mobile platforms contain built-in security features to protect user data and privacy. These features act as a hurdle during the forensic acquisition and examination. For example, modern mobile devices come with default encryption mechanisms from the hardware layer to the software layer. The examiner might need to break through these encryption mechanisms to extract data from the devices. Lack of resources: As mentioned earlier, with the growing number of mobile phones, the tools required by a forensic examiner would also increase. Forensic acquisition accessories, such as USB cables, batteries, and chargers for different mobile phones, have to be maintained in order to acquire those devices. Preventing data modification: One of the fundamental rules in forensics is to make sure that data on the device is not modified. In other words, any attempt to extract data from the device should not alter the data present on that device. But this is practically not possible with mobiles because just switching on a device can change the data on that device. Even if a device appears to be in an off state, background processes may still run. For example, in most mobiles, the alarm clock still works even when the phone is switched off. A sudden transition from one state to another may result in the loss or modification of data. Anti-forensic techniques: Anti-forensic techniques, such as data hiding, data obfuscation, data forgery, and secure wiping, make investigations on digital media more difficult. Dynamic nature of evidence: Digital evidence may be easily altered either intentionally or unintentionally. For example, browsing an application on the phone might alter the data stored by that application on the device. Accidental reset: Mobile phones provide features to reset everything. Resetting the device accidentally while examining may result in the loss of data. Device alteration: The possible ways to alter devices may range from moving application data, renaming files, and modifying the manufacturer's operating system. In this case, the expertise of the suspect should be taken into account. Passcode recovery: If the device is protected with a passcode, the forensic examiner needs to gain access to the device without damaging the data on the device. While there are techniques to bypass the screen lock, they may not work always on all the versions. Communication shielding: Mobile devices communicate over cellular networks, Wi-Fi networks, Bluetooth, and Infrared. As device communication might alter the device data, the possibility of further communication should be eliminated after seizing the device. Lack of availability of tools: There is a wide range of mobile devices. A single tool may not support all the devices or perform all the necessary functions, so a combination of tools needs to be used. Choosing the right tool for a particular phone might be difficult. Malicious programs: The device might contain malicious software or malware, such as a virus or a Trojan. Such malicious programs may attempt to spread over other devices over either a wired interface or a wireless one. Legal issues: Mobile devices might be involved in crimes, which can cross geographical boundaries. In order to tackle these multijurisdictional issues, the forensic examiner should be aware of the nature of the crime and the regional laws. Summary Mobile devices store a wide range of information such as SMS, call logs, browser history, chat messages, location details, and so on. Mobile device forensics includes many approaches and concepts that fall outside of the boundaries of traditional digital forensics. Extreme care should be taken while handling the device right from evidence intake phase to archiving phase. Examiners responsible for mobile devices must understand the different acquisition methods and the complexities of handling the data during analysis. Extracting data from a mobile device is half the battle. The operating system, security features, and type of smartphone will determine the amount of access you have to the data. It is important to follow sound forensic practices and make sure that the evidence is unaltered during the investigation. Resources for Article: Further resources on this subject: Forensics Recovery [article] Mobile Phone Forensics – A First Step into Android Forensics [article] Mobility [article]

0
0
41523

Packt

25 Apr 2016

17 min read

Features of Sitecore

Packt

25 Apr 2016

17 min read

In this article by Yogesh Patel, the author of the book, Sitecore Cookbook for Developers, we will discuss about the importance of Sitecore and its good features. (For more resources related to this topic, see here.) Why Sitecore? Sitecore Experience Platform (XP) is not only an enterprise-level content management system (CMS), but rather a web framework or web platform, which is the global leader in experience management. It continues to be very popular because of its highly scalable and robust architecture, continuous innovations, and ease of implementations compared to other CMSs available. It also provides an easier integration with many external platforms such as customer relationship management (CRM), e-commerce, and so on. Sitecore architecture is built with the Microsoft .NET framework and provides greater depth of APIs, flexibility, scalability, performance, and power to developers. It has great out-of-the-box capabilities, but one of its great strengths is the ease of extending these capabilities; hence, developers love Sitecore! Sitecore provides many features and functionalities out of the box to help content owners and marketing teams. These features can be extended and highly customized to meet the needs of your unique business rules. Sitecore provides these features with different user-friendly interfaces for content owners that helps them manage content and media easily and quickly. Sitecore user interfaces are supported on almost every modern browser. In addition, fully customized web applications can be layered in and integrated with other modules and tools using Sitecore as the core platform. It helps marketers to optimize the flow of content continuously for better results and more valuable outcomes. It also provides in-depth analytics, personalized experience to end users, and marketing automation tools, which play a significant role for marketing teams. The following are a few of the many features of Sitecore. CMS based on the .NET Framework Sitecore provides building components on ASP.NET Web forms as well as ASP.NET Model-View-Controller (MVC) frameworks, so developers can choose either approach to match the required architecture. Sitecore provides web controls and sublayouts while working with ASP.NET web forms and view rendering, controller rendering, and models and item rendering while working with the ASP.NET MVC framework. Sitecore also provides two frameworks to prepare user interface (UI) applications for Sitecore clients—Sheer UI and SPEAK. Sheer UI applications are prepared using Extensible Application Markup Language (XAML) and most of the Sitecore applications are prepared using Sheer UI. Sitecore Process Enablement and Accelerator Kit (SPEAK) is the latest framework to develop Sitecore applications with a consistent interface quickly and easily. SPEAK gives you a predefined set of page layouts and components: Component-based architecture Sitecore is built on a component-based architecture, which provides us with loosely coupled independent components. The main advantage of these components is their reusability and loosely coupled independent behaviour. It aims to provide reusability of components at the page level, site level, and Sitecore instance level to support multisite or multitenant sites. Components in Sitecore are built with the normal layered approach, where the components are split into layers such as presentation, business logic, data layer, and so on. Sitecore provides different presenation components, including layouts, sublayouts, web control renderings, MVC renderings, and placeholders. Sitecore manages different components in logical grouping by their templates, layouts, sublayouts, renderings, devices, media, content items, and so on: Layout engine The Sitecore layout engine extends the ASP.NET web application server to merge content with presentation logic dynamically when web clients request resources. A layout can be a web form page (.aspx) or MVC view (.cshtml) file. A layout can have multiple placeholders to place content on predefined places, where the controls are placed. Controls can be HTML markup controls such as a sublayout (.ascx) file, MVC view (.cshtml) file, or other renderings such as web control, controller rendering, and so on, which can contain business logic. Once the request criteria are resolved by the layout engine, such as item, language, and device, the layout engine creates a platform to render different controls and assemble their output to relevant placeholders on the layout. Layout engine provides both static and dynamic binding. So, with dynamic binding, we can have clean HTML markups and reusability of all the controls or components. Binding of controls, layouts, and devices can be applied on Sitecore content items itself, as shown in the following screenshot: Once the layout engine renders the page, you can see how the controls will be bound to the layout, as shown in the following image: The layout engine in Sitecore is reponsible for layout rendering, device detection, rule engine, and personalization: Multilingual support In Sitecore, content can be maintained in any number of languages. It provides easier integration with external translation providers for seamless translation and also supports the dynamic creation of multilingual web pages. Sitecore also supports the language fallback feature on the field, item, and template level, which makes life easier for content owners and developers. It also supports chained fallback. Multi-device support Devices represent different types of web clients that connect to the Internet and place HTTP requests. Each device represents a different type of web client. Each device can have unique markup requirements. As we saw, the layout engine applies the presentation components specified for the context device to the layout details of the context item. In the same way, developers can use devices to format the context item output using different collections of presentation components for various types of web clients. Dynamically assembled content can be transformed to conform to virtually any output format, such as a mobile, tablet, desktop, print, or RSS. Sitecore also supports the device fallback feature so that any web page not supported for the requesting device can still be served through the fallback device. It also supports chained fallback for devices. Multi-site capabilities There are many ways to manage multisites on a single Sitecore installation. For example, you can host multiple regional domains with different regional languages as the default language for a single site. For example, http://www.sitecorecookbook.com will serve English content, http://www.sitecorecookbook.de will serve German content of the same website, and so on. Another way is to create multiple websites for different subsidiaries or franchise of a company. In this approach, you can share some common resources across all the sites such as templates, renderings, user interface elements, and other content or media items, but have unique content and pages so that you can find a separate existence of each website in Sitecore. Sitecore has security capabilities so that each franchise or subsidiary can manage their own website independently without affecting other websites. Developers have full flexibility to re-architect Sitecore's multisite architecture as per business needs. Sitecore also supports multitenant multisite architecture so that each website can work as an individual physical website. Caching Caching plays a very important role in website performance. Sitecore contains multiple levels of caching such as prefetch cache, data cache, item cache, and HTML cache. Apart from this, Sitecore creates different caching such as standard values cache, filtered item cache, registry cache, media cache, user cache, proxy cache, AccessResult cache, and so on. This makes understanding all the Sitecore caches really important. Sitecore caching is a very vast topic to cover; you can read more about it at http://sitecoreblog.patelyogesh.in/2013/06/how-sitecore-caching-work.html. Configuration factory Sitecore is configured using IIS's configuration file, Web.config. Sitecore configuration factory allows you to configure pipelines, events, scheduling agents, commands, settings, properties, and configuration nodes in Web.config files, which can be defined in the /configuration/sitecore path. Configurations inside this path can be spread out between multiple files to make it scalable. This process is often called config patching. Instead of touching the Web.config file, Sitecore provides the Sitecore.config file in the App_ConfigInclude directory, which contains all the important Sitecore configurations. Functionality-specific configurations are split into the number of .config files based, which you can find in its subdirectories. These .config files are merged into a single configuration file at runtime, which you can evaluate using http://<domain>/sitecore/admin/showconfig.aspx. Thus, developers create custom .config files in the App_ConfigInclude directory to introduce, override, or delete settings, properties, configuration nodes, and attributes without touching Sitecore's default .config files. This makes managing .config files very easy from development to deployment. You can learn more about file patching from https://sdn.sitecore.net/upload/sitecore6/60/include_file_patching_facilities_sc6orlater-a4.pdf. Dependency injection in .NET has become very common nowadays. If you want to build a generic and reusable functionality, you will surely go for the inversion of control (IoC) framework. Fortunately, Sitecore provides a solution that will allow you to easily use different IoC frameworks between projects. Using patch files, Sitecore allows you to define objects that will be available at runtime. These nodes are defined under /configuration/sitecore and can be retrieved using the Sitecore API. We can define types, constructors, methods, properties, and their input parameters in logical nodes inside nodes of pipelines, events, scheduling agents, and so on. You can learn more examples of it from http://sitecore-community.github.io/docs/documentation/Sitecore%20Fundamentals/Sitecore%20Configuration%20Factory/. Pipelines An operation to be performed in multiple steps can be carried out using the pipeline system, where each individual step is defined as a processor. Data processed from one processor is then carried to the next processor in arguments. The flow of the pipeline can be defined in XML format in the .config files. You can find default pipelines in the Sitecore.config file or patch file under the <pipelines> node (which are system processes) and the <processors> node (which are UI processes). The following image visualizes the pipeline and processors concept: Each processor in a pipeline contains a method named Process() that accepts a single argument, Sitecore.Pipelines.PipelineArgs, to get different argument values and returns void. A processor can abort the pipeline, preventing Sitecore from invoking subsequent processors. A page request traverses through different pipelines such as <preProcessRequest>, <httpRequestBegin>, <renderLayout>, <httpRequestEnd>, and so on. The <httpRequestBegin> pipeline is the heart of the Sitecore HTTP request execution process. It defines different processors to resolve the site, device, language, item, layout, and so on sequentially, which you can find in Sitecore.config as follows: <httpRequestBegin> ... <processor type="Sitecore.Pipelines.HttpRequest.SiteResolver, Sitecore.Kernel"/> <processor type="Sitecore.Pipelines.HttpRequest.UserResolver, Sitecore.Kernel"/> <processor type=" Sitecore.Pipelines.HttpRequest.DatabaseResolver, Sitecore.Kernel"/> <processor type=" Sitecore.Pipelines.HttpRequest.BeginDiagnostics, Sitecore.Kernel"/> <processor type=" Sitecore.Pipelines.HttpRequest.DeviceResolver, Sitecore.Kernel"/> <processor type=" Sitecore.Pipelines.HttpRequest.LanguageResolver, Sitecore.Kernel"/> ... </httpRequestBegin> There are more than a hundred pipelines, and the list goes on increasing after every new version release. Sitecore also allows us to create our own pipelines and processors. Background jobs When you need to do some long-running operations such as importing data from external services, sending e-mails to subscribers, resetting content item layout details, and so on, we can use Sitecore jobs, which are asynchronous operations in the backend that you can monitor in a foreground thread (Job Viewer) of Sitecore Rocks or by creating a custom Sitecore application. The jobs can be invoked from the user interface by users or can be scheduled. Sitecore provides APIs to invoke jobs with many different options available. You can simply create and start a job using the following code: public void Run() { JobOptions options = new JobOptions("Job Name", "Job Category", "Site Name", "current object", "Task Method to Invoke", new object[] { rootItem }) { EnableSecurity = true, ContextUser = Sitecore.Context.User, Priority = ThreadPriority.AboveNormal }; JobManager.Start(options); } You can schedule tasks or jobs by creating scheduling agents in the Sitecore.config file. You can also set their execution frequency. The following example shows you how Sitecore has configured PublishAgent, which publishes a site every 12 hours and simply executes the Run() method of the Sitecore.Tasks.PublishAgent class: <scheduling> <agent type="Sitecore.Tasks.PublishAgent" method="Run" interval="12:00:00"> <param desc="source database">master</param> <param desc="target database">web</param> <param desc="mode (full or smart or incremental)">incremental</param> <param desc="languages">en, da</param> </agent> </scheduling> Apart from this, Sitecore also provides you with the facility to define scheduled tasks in the database, which has a great advantage of storing tasks in the database, so that we can handle its start and end date and time. We can use it once or make it recurring as well. Workflow and publishing Workflows are essential to the content author experience. Workflows ensure that items move through a predefined set of states before they become publishable. It is necessary to ensure that content receives the appropriate reviews and approvals before publication to the live website. Apart from workflow, Sitecore provides highly configurable security features, access permissions, and versioning. Sitecore also provides full workflow history like when and by whom the content was edited, reviewed, or approved. It also allows you to restrict publishing as well as identify when it is ready to be published. Publishing is an essential part of working in Sitecore. Every time you edit or create new content, you have to publish it to see it on your live website. When publishing happens, the item is copied from the master database to the web database. So, the content of the web database will be shown on the website. When multiple users are working on different content pages or media items, publishing restrictions and workflows play a vital role to make releases, embargoed, or go-live successful. There are three types of publishing available in Sitecore: Republish: This publishes every item even though items are already published. Smart Publish: Sitecore compares the internal revision identifier of the item in the master and web databases. If both identifiers are different, it means that the item is changed in the master database, hence Sitecore will publish the item or skip the item if identifiers are the same. Incremental Publish: Every modified item is added to the publish queue. Once incremental publishing is done, Sitecore will publish all the items found in the publish queue and clear it. Sitecore also supports the publishing of subitems as well as related items (such as publishing a content item will also publish related media items). Search Sitecore comes with out-of-the-box Lucene support. You can also switch your Sitecore search to Solr, which just needs to install Solr and enable Solr configurations already available. Sitecore by default indexes Sitecore content in Lucene index files. The Sitecore search engine lets you search through millions of items of the content tree quickly with the help of different types of queries with Lucene or Solr indexes. Sitecore provides you with the following functionalities for content search: We can search content items and documents such as PDF, Word, and so on. It allows you to search content items based on preconfigured fields. It provides APIs to create and search composite fields as per business needs. It provides content search APIs to sort, filter, and page search results. We can apply wildcards to search complex results and autosuggest. We can apply boosting to influence search results or elevate results by giving more priority. We can create custom dictionaries and index files, using which we can suggest did you mean kind of suggestions to users. We can apply facets to refine search results as we can see on e-commerce sites. W can apply different analyzers to hunt MoreLikeThis results or similar results. We can tag content or media items to categorize them so that we can use features such as a tag cloud. It provides a scalable user interface to search content items and apply filters and operations to selected search results. It provides different indexing strategies to create transparent and diverse models for index maintenance. In short, Sitecore allows us to implement different searching techniques, which are available in Google or other search engines. Content authors always find it difficult while working with a big number of items. You can read more about Sitecore search at https://doc.sitecore.net/sitecore_experience_platform/content_authoring/searching/searching. Security model Sitecore has the reputation of being very easy to set up the security of users, roles, access rights, and so on. Sitecore follows the .NET security model, so we get all the basic information of the .NET membership in Sitecore, which offers several advantages: A variety of plug-and-play features provided directly by Microsoft The option to replace or extend the default configuration with custom providers It is also possible to store the accounts in different storage areas using several providers simultaneously Sitecore provides item-level and field-level rights and an option to create custom rights as well Dynamic user profile structure and role management is possible just through the user interface, which is simpler and easier compared to pure ASP.NET solutions It provides easier implementation for integration with external systems Even after having an extended wrapper on the .NET solution, we get the same performance as a pure ASP.NET solution Experience analytics and personalization Sitecore contains state-of-the-art Analysis, Insights, Decisions, Automation (AIDA) framework, which is the heart for marketing programs. It provides comprehensive analytics data and reports, insights from every website interaction with rules, behavior-based personalization, and marketing automation. Sitecore collects all the visitor interactions in a real-time, big data repository—Experience Database (xDB)—to increase the availability, scalability, and performance of website. Sitecore Marketing Foundation provides the following features: Sitecore uses MongoDB, a big marketing data repository that collects all customer interactions. It provides real-time data to marketers to automate interactions across all channels. It provides a unified 360 degree view of the individual website visitors and in-depth analytics reports. It provides fundamental analytics measurement components such as goals and events to evaluate the effectiveness of online business and marketing campaigns. It provides comprehensive conditions and actions to achieve conditional and behavioral or predictive personalization, which helps show customers what they are looking for instead of forcing them to see what we want to show. Sitecore collects, evaluates, and processes Omnichannel visitor behavioral patterns, which helps better planned effective marketing campaigns and improved user experience. Sitecore provides an engagement plan to control how your website interacts with visitors. It helps nurture relationships with your visitors by adapting personalized communication based on which state they are falling. Sitecore provides an in-depth geolocation service, helpful in optimizing campaigns through segmentation, personalization, and profiling strategies. The Sitecore Device Detection service is helpful in personalizing user experience or promotions based on the device they use. It provides different dimensions and reports to reflect data on full taxonomy provided in Marketing Control Panel. It provides different charting controls to get smart reports. It has full flexibility for developers to customize or extend all these features. High performance and scalability Sitecore supports heavy content management and content delivery usage with a large volume of data. Sitecore is architected for high performance and unlimited scalability. Sitecore cache engine provides caching on the raw data as well as rendered output data, which gives a high-performance platform. Sitecore uses the event queue concept for scalability. Theoretically, it makes Sitecore scalable to any number of instances under a load balancer. Summary In this article, we discussed about the importance of Sitecore and its good features. We also saw that Sitecore XP is not only an enterprise-level CMS, but also a web platform, which is the global leader in experience management. Resources for Article: Further resources on this subject: Building a Recommendation Engine with Spark [article] Configuring a MySQL linked server on SQL Server 2008 [article] Features and utilities in SQL Developer Data Modeler [article]

0
0
36942

article-image-getting-started-apache-hadoop-and-apache-spark

Packt

22 Apr 2016

12 min read

Getting Started with Apache Hadoop and Apache Spark

Packt

22 Apr 2016

12 min read

0
0
4265

Packt

21 Apr 2016

13 min read

Introducing Dynamics CRM

Packt

21 Apr 2016

13 min read

0
0
1725

Packt

21 Apr 2016

17 min read

Reusability patterns

Packt

21 Apr 2016

17 min read

In this article by Jaime Soriano Pastor and Allesandro Frachezi, the authors of Extending Puppet - Second Edition, you will learn that the modules reusability is a topic that has got more and more attention in the past few years; as more people started to use Puppet, more evident became the need of having some common and shared code to manage common things. The main characteristics of reusable modules are: They can be used by different people without the need to modify their content They support different OSes, and allow easy extension to new ones They allow users to override the default files provided by the module They might have an opinionated approach to the managed resources but don't force it They follow a single responsibility principle and should manage only the application they are made for Reusability, we must underline, is not an all-or-nothing feature; we might have different levels of reusability to fulfill the needs of a variant percentage of users. For example, a module might support Red Hat and Debian derivatives, but not Solaris or AIX; Is it reusable? If we use the latter OSes, definitively not, if we don't use them, yes, for us it is reusable. I am personally a bit extreme about reusability, and according to me, a module should also: Allow users to provide alternative classes for eventual dependencies from other modules, to ease interoperability Allow any kind of treatment of the managed configuration files, be that file- or setting-based Allow alternative installation methods Allow users to provide their own classes for users or other resources, which could be managed in custom and alternative ways Allow users to modify the default settings (calculated inside the module according to the underlining OS) for package and service names, file paths, and other more or less internal variables that are not always exposed as parameters. Expose parameters that allow removal of the resources provided by the module (this is a functionality feature more than a reusability one) Abstract monitoring and firewalling features so that they are not directly tied to specific modules or applications Managing files Everything is a file in UNIX, and Puppet most of the times manages files. A module can expose parameters that allow its users to manipulate configuration files and it can follow one or both the file/setting approaches, as they are not alternative and can coexist. To manage the contents of a file, Puppet provides different alternative solutions: Use templates, populated with variables that come from parameters, facts, or any scope (argument for the File type: content => template('modulename/path/templatefile.erb') Use static files, served by the Puppet server Manage the file content via concat (https://github.com/puppetlabs/puppetlabs-concat) a module that provides resources that allow to build a file joining different fragments. Manage the file contents via augeas, a native type that interfaces with the Augeas configuration editing tool (http://augeas.net/) Manage the contents with alternative in-file line editing tools For the first two cases, we can expose parameters that allow to define the module's main configuration file either directly via the source and content arguments, or by specifying the name of the template to be passed to the template() function: class redis ( $config_file = $redis::params::file, $config_file_source = undef, $config_file_template = undef, $config_file_content = undef, ) { Manage the configuration file arguments with: $managed_config_file_content = $config_file_content ? { undef => $config_file_template ? { undef => undef, default => template($config_file_template), }, default => $config_file_content, } The $managed_config_file_content variable computed here takes the value of the $config_file_content, if present; otherwise, it uses the template defined with $config_file_template. If also this parameter is unset, the value is undef: if $redis::config_file { file { 'redis.conf': path => $redis::config_file, source => $redis::config_file_source, content => $redis::managed_config_file_content, } } } In this way, users can populate redis.conf via a custom template (placed in the site module): class { 'redis': config_file_template => 'site/redis/redis.conf.erb', } Otherwise, they can also provide the content attribute directly: class { 'redis': config_file_content => template('site/redis/redis.conf.erb'), } Finally, they can also provide a fileserver source path: class { 'redis': config_file_source => 'puppet:///modules/site/redis/redis.conf', } In case users prefer to manage the file in other ways (augeas, concat, and so on), they can just include the main class, which, by default, does not manage the configuration file's contents and uses whatever method to alter them: class { 'redis': } A good module could also provide custom defines that allow easy and direct ways to alter configuration files' single lines, either using Augeas or other in-file line management tools. Managing configuration hash patterns If we want a full infrastructure as data setup and be able to manage all our configuration settings as data, we can follow two approaches, regarding the number, name, and kind of parameters to expose: Provide a parameter for each configuration entry we want to manage Provide a single parameter that expects a hash where any configuration entry may be defined The first approach requires a substantial and ongoing effort, as we have to keep our module's classes updated with all the current and future configuration settings our application may have. Its benefit is that it allows us to manage them as plain and easily readable data on, for example, Hiera YAML files. Such an approach is followed, for example, by the OpenStack modules (https://github.com/stackforge) where the configurations of the single components of OpenStack are managed on a settings-based approach, which is fed by the parameters of various classes and subclasses. For example, the Nova module (https://github.com/stackforge/puppet-nova) has many subclasses where the parameters that map to Nova's configuration entries are exposed and are applied via the nova_config native type, which is a basically a line editing tool that works line by line. An alternative and quicker approach is to just define a single parameter, like config_file_options_hash that accepts any settings as a hash: class openssh ( $config_file_options_hash = { }, } Then, manage in a custom template the hash, either via a custom function, like the hash_lookup() provided by the stdmod shared module (https://github.com/stdmod/stdmod): # File Managed by Puppet [...] Port <%= scope.function_hash_lookup(['Port','22']) %> PermitRootLogin <%= scope.function_hash_lookup(['PermitRootLogin','yes']) %> UsePAM <%= scope.function_hash_lookup(['UsePAM','yes']) %> [...] Otherwise, refer directly to a specific key of the config_file_options_hash parameter: Port <%= scope.lookupvar('openssh::config_file_options_hash')['Port'] ||= '22' %> PermitRootLogin <%= scope.lookupvar('openssh::config_file_options_hash')['PermitRootLogin'] ||= 'yes' %> UsePAM <%= scope.lookupvar('openssh::config_file_options_hash')['UsePAM'] ||= 'yes' %> [...] Needless to say that Hiera is a good place to define these parameters; on a YAML-based backend, we can set these parameters with: --- openssh::config_file_template: 'site/openssh/sshd_config.erb'   openssh::config_file_options_hash: Port: '22222' PermitRootLogin: 'no' Otherwise, if we prefer to use an explicit parameterized class declaration: class { 'openssh': config_file_template => 'site/openssh/sshd_config.erb' config_file_options_hash => { Port => '22222', PermitRootLogin => 'no', } } Managing multiple configuration files An application may have different configuration files and our module should provide ways to manage them. In these cases, we may have various options to implement in a reusable module: Expose parameters that let us configure the whole configuration directory Expose parameters that let us configure specific extra files Provide a general purpose define that eases management of configuration files To manage the whole configuration directory these parameters should be enough: class redis ( $config_dir_path = $redis::params::config_dir, $config_dir_source = undef, $config_dir_purge = false, $config_dir_recurse = true, ) { $config_dir_ensure = $ensure ? { 'absent' => 'absent', 'present' => 'directory', } if $redis::config_dir_source { file { 'redis.dir': ensure => $redis::config_dir_ensure, path => $redis::config_dir_path, source => $redis::config_dir_source, recurse => $redis::config_dir_recurse, purge => $redis::config_dir_purge, force => $redis::config_dir_purge, } } } Such a code would allow providing a custom location, on the Puppet Master, to use as source for the whole configuration directory: class { 'redis': config_dir_source => 'puppet:///modules/site/redis/conf/', } Provide a custom source for the whole config_dir_path and purge any unmanaged config file; all the destination files not present on the source directory would be deleted. Use this option only when we want to have complete control on the contents of a directory: class { 'redis': config_dir_source => [ "puppet:///modules/site/redis/conf--${::fqdn}/",   "puppet:///modules/site/redis/conf-${::role}/", 'puppet:///modules/site/redis/conf/' ], config_dir_purge => true, } Consider that the source files, in this example, placed in the site module according to a naming hierarchy that allows overrides per node or role name, can only be static and not templates. If we want to provide parameters that allow direct management of alternative extra files, we can add parameters such as the following (stdmod compliant): class postgresql ( $hba_file_path = $postgresql::params::hba_file_path, $hba_file_template = undef, $hba_file_content = undef, $hba_file_options_hash = { } , ) { […] } Finally, we can place in our module a general purpose define that allows users to provide the content for any file in the configuration directory. Here is an example https://github.com/example42/puppet-pacemaker/blob/master/manifests/conf.pp The usage is as easy as: pacemaker::conf { 'authkey': source => 'site/pacemaker/authkey', } Managing users and dependencies Sometimes a module has to create a user or have some prerequisite packages installed in order to have its application running correctly. These are the kind of "extra" resources that can create conflicts among modules, as we may have them already defined somewhere else in the catalog via other modules. For example, we may want to manage users in our own way and don't want them to be created by an application module, or we may already have classes that manage the module's prerequisite. There's not a universally defined way to cope with these cases in Puppet, if not the principle of single point of responsibility, which might conflict with the need to have a full working module, when it requires external prerequisites. My personal approach, which I've not seen being used around, is to let the users define the name of alternative classes, if any, where such resources can be managed. On the code side, the implementation is quite easy: class elasticsearch ( $user_class = 'elasticsearch::user', ) { [...] if $elasticsearch::user_class { require $elasticsearch::user_class } Also, of course, in elasticsearch/manifests/user.pp, we can define the module's default elasticsearch::user class. Module users can provide custom classes with: class { 'elasticsearch': user_class => 'site::users::elasticsearch', } Otherwise, they decide to manage users in other ways and unset any class name: class { 'elasticsearch': user_class => '', } Something similar can be done for a dependency class or other classes. In an outburst of a reusability spree, in some cases, I added parameters to let users define alternative classes for the typical module classes: class postgresql ( $install_class = 'postgresql::install', $config_class = 'postgresql::config', $setup_class = 'postgresql::setup', $service_class = 'postgresql::service', [… ] ) { […] } Maybe this is really too much, but, for example, giving users the option to define the install class to use, and have it integrated in the module's own relationships logic, may be useful for cases where we want to manage the installation in a custom way. Managing installation options Generally, it is recommended to always install applications via packages, eventually to be created onsite when we can't find fitting public repositories. Still, sometimes, we might need to, have to, or want to install an application in other ways; for example just downloading its archive, extracting it, and eventually compiling it. It may not be a best practice, but still it can be done, and people do it. Another reusability feature a module may provide is alternative methods to manage the installation of an application. Implementation may be as easy as: class elasticsearch ( $install_class = 'elasticsearch::install', $install = 'package', $install_base_url = $elasticsearch::params::install_base_url, $install_destination = '/opt', ) { These options expose both the install method to be used, the name of the installation class (so that it can be overridden), the URL from where to retrieve the archive, and the destination at which to install it. In init.pp, we can include the install class using the parameter that sets its name: include $install_class In the default install class file (here install.pp) manage the install parameter with a case switch: class elasticsearch::install { case $elasticsearch::install { package: { package { $elasticsearch::package: ensure => $elasticsearch::managed_package_ensure, provider => $elasticsearch::package_provider, } } upstream: { puppi::netinstall { 'netinstall_elasticsearch': url => $elasticsearch::base_url, destination_dir => $elasticsearch::install_destination, owner => $elasticsearch::user, group => $elasticsearch::user, } } default: { fail('No valid install method defined') } } } The puppi::netinstall defined in the preceding code comes from a module of mine (https://github.com/example42/puppi) and it's used to download, extract, and eventually execute custom commands on any kind of archive. Users can therefore define which installation method to use with the install parameter and they can even provide another class to manage in a custom way the installation of the application. Managing extra resources Many times, we have in our environment some customizations that cannot be managed just by setting different parameters or names. Sometimes, we have to create extra resources, which no public module may provide as they are too custom and specific. While we can place these extra resources in any class, we may include in our nodes; it may be useful to link this extra class directly to our module, providing a parameter that lets us specify the name of an extra custom class, which, if present, is included (and contained) by the module: class elasticsearch ( $my_class = undef, ) { [...] if $elasticsearch::my_class { include $elasticsearch::my_class Class[$elasticsearch::my_class] -> Anchor['elasticsearch::end'] } } Another method to let users create extra resources by passing a parameter to a class is based on the create_resources function. We have already seen it; it creates all the resources of a given type from a nested hash where their names and arguments can be defined. Here is an example from https://github.com/example42/puppet-network: class network ( $interfaces_hash = undef, […] ) { […] if $interfaces_hash { create_resources('network::interface', $interfaces_hash) } } In this case, the type used is network::interface (provided by the same module) and it can be fed with a hash. On Hiera, with the YAML backend, it could look like this: --- network::interfaces_hash: eth0: method: manual bond_master: 'bond3' allow_hotplug: 'bond3 eth0 eth1 eth2 eth3' eth1: method: manual bond_master: 'bond3' bond3: ipaddress: '10.10.10.3' netmask: '255.255.255.0' gateway: '10.10.10.1' dns_nameservers: '8.8.8.8 8.8.4.4' bond_mode: 'balance-alb' bond_miimon: '100' bond_slaves: 'none' Summary As we can imagine, the usage patterns that such a function allows are quite wide and interesting. Being able to base, on pure data, all the information we need to create a resource may definitively shift most of the logic and the implementation that is done with Puppet code and normal resources to the data backend.

0
0
1741

Packt

20 Apr 2016

12 min read

Hello World Program

Packt

20 Apr 2016

12 min read

0
0
11808

Packt

19 Apr 2016

10 min read

Working With Compliance

Packt

19 Apr 2016

10 min read

In this article by Abhijeet Shriram Janwalkar, the author of VMware vRealize Configuration Manager Cookbook, we will discuss how to check compliance, create exceptions so that we don't get any false positives, and finally, create some Alert Rules that will alert us when non-compliant rules are found. (For more resources related to this topic, see here.) Checking the compliance of the infrastructure After creating all the rules, rule groups, and templates, we need to check the compliance of the infrastructure. We will learn how to check how compliant we are against internal standards, or we can directly use standard compliance packs we have already downloaded and imported. We will use a standard imported template for this recipe. Getting ready All the heavy lifting should have been done on the VCM server: it should be ready with the templates and at least one machine group, which will have the machines for whom we need to check the compliance, or we can use the default machine groups available. Using our own machine group is preferable. How to do it... As mentioned earlier, we will use an imported standard template, International Organization for Standardization 27001-27002- Windows 2008 R2 Mbr Server Controls, and we will run this against the default machine group, All Machines. Follow these steps to check the compliance of the Windows servers: Once logged in to VCM, go to Compliance | Templates Make sure the correct machine group is selected from the top; this is how VCM decides which machines to apply the template to measure the compliance. If you want to change the machine group, click on the Machine Group, and from the popup, select the correct machine group. Select the required template from the right-hand side and click on Run. Depending upon the organization policies, decide to enforce or not enforce the compliance. Not all rules are enforceable; also, we can cause issues such as breaking a working application; for example, if the print spooler service is required to be disabled and we disable the service when we enforce the compliance, this will create an issue on the printer farm as it will stop functioning. So it is better that we first learn what is non-compliant and then make necessary exceptions. We can then enforce compliance from VCM or can ask the respective server owners to take the necessary action. In a few minutes, depending on how many machines there are to check in the machine group, the compliance run will finish. Click on Close. The compliance status can be viewed by navigating to the template on the left-hand side and selecting the correct machine group from the top. In our case, our support team needs to work a lot as we are non-compliant. How it works... When we ask VCM to check compliance, it first applies the filters available in the rule groups, and then, only the machines that pass those filters are considered. The compliance checks are performed on the data collected by VCM and are available in the database, unlike some other tools that perform the checks at the client end, after which the client submits the data to VCM. The process followed by VCM is better as this can be performed on servers that are offline at that time, and when we check the result, we get the value because of which the machine is non-compliant for a rule. Again, this has some issues as well: first, we need to make sure our VCM is clean. By this, I mean whether a machine is purged from VCM when it is decommissioned, or else we will have details of machines that are not present in the infrastructure, and that could affect our final compliance score. The second issue is that it does not give us live details as it works on the data in its database; again, this can produce false positives. To counter this issue, we can schedule a compliance check after a full data collection for that machine group, in which way we will not have stale data to process. Once the compliance has been checked, and if we have chosen to enforce the compliance, it will create jobs to enforce them and will start executing on the managed machines; for example, if we have rules to check the status of a service and expect certain services in the running state, then VCM will start those services. Creating compliance exceptions As you know, every rule has an exception, and this is applicable to compliance as well: you create a rule for blocking the SMTP port on all the servers, and then, you have mail servers that need this port active. Now, we can't block the port, but at the same time, we know this is a known and accepted deviation; hence, we don't want our compliance score to suffer a hit because of this. To solve this, what we can do is add an exception so that this will not create issues while checking compliance. Getting ready Our organization has a policy to disable unwanted services on servers, and the print spooler is considered an unwanted service, so it must be disabled on all the servers but, of course, the exceptions are the print servers. We will create an exception for the print server machine group to be excused from this mandate. We will need rules created in VCM along with a machine group that will include all the print servers. How to do it... Let's create an exception for our print servers by following these steps: Log in to VCM and go to Compliance | Machine Group Compliance | Exceptions. Click on Add. Provide a descriptive Name and Description, and click on Next. Select the template for which you want this machine group to be excluded. In this case, we are selecting the one created by our organization rules. Click on Next. Select the machine group created for this exception; in our case, it is named Print Servers. Click on Next. Select Override non-compliant results to compliant. I really don't know why there is another option, but there must be a use case that I am not aware of. We want this exception only for our rule for the print spooler server, called Service_Print_Spooler; so select that rule. Depending upon you requirement, you can have the exception for a complete rule group as well. But having exception for a single rule is sufficient in our case. Click on Finish. You can enable/disable this exception as per requirement. How it works... Exceptions are considered when we do a compliance check, and a final score is calculated. By creating an exception, we make sure that we don't get a bad score just because we need to have some things non-compliant. Also, this helps when we are enforcing compliance like in the earlier case, where we enforced Service Status to be disabled then VCM disabled the print spooler service on all the servers including the print servers, and that would have affected productivity. So, creating compliance exception is a win-win situation for both teams: the security team has a nice compliant environment and the printer admin team has a working print farm. Creating compliance alert rules Nobody likes to wait and nobody likes to work on Excel, so what if we get a ticket in our ticketing tool if a managed machine is non-complaint. We can create alerts and then maybe integrate them with a ticketing tool that can create a ticket for us, or VCM can send an e-mail to configured e-mail IDs. Getting ready We will need a working VCM server that is configured to check compliance. How to do it... This is a two-step process; first, we need to create an alert rule and then associate the created alert with the machine group. So, let's begin creating a compliance alert: Log in to VCM and go to Administration | Alerts | Rules. Click on Add. In the wizard, give descriptive name and add a description. Click on Next. Select Compliance Results Data as the Data Type and click on Next. As we want to create an alert for the non-compliancy of the rules created for our organization standards, select the appropriate compliance template. If we want an alert for the ISO 27001-27002 standard, we should have opted for that template. On the next page, accept the newly create rule by clicking on Finish (the button is not in the screenshot). The next process is to associate this Rule to correct Machine group. So we will continue to step 6. Now move to Administration | AlertsàMachine Group Configuration, and select the Machine group for which you would like the alert to be generated and click Add. Select the alert we created and click on Next. Select Severity and click on Next (not shown in the screenshot). Select the actions that need to be done when the alert is created: we can send an e-mail, we can send an SNMP trap to a monitoring system or VCO that will create an alert in the organization ticketing system, or we can write the log to the Windows event log, and then, from there it will be picked up by the monitoring system to create a ticket. We are choosing to send an e-mail to the concerned people or teams. Provide the details of who should receive the e-mail, the sender's e-mail ID, the SMTP server, e-mail subject, and modify the message body. If required, you can check for alerts and click on Finish to close the wizard (the button is not shown in the screenshot). The alerts can be seen at Console| Alerts. How it works... We can't just depend on reports for checking compliance, even though that is a good way to check the status, but getting alerts for a non-compliant machine can be more proactive than going through a report. When we check or schedule a check for compliance, the result can be stored in the VCM database and is fetched when we visit the Compliance tab on the VCM console, Alerts provide a more proactive approach: they tell you that there is something wrong and you need to check it, so after every compliance check, if that machine group has something non-compliant, an alert will be created and that will take configured actions like sending an e-mail, sending an SNMP trap, or writing an event to the Windows logs. Those can be proactively worked upon rather than going to the console and checking the reports. Summary In this article, we learned how to check compliance using a standard imported template. We also learned how to create exceptions for our compliance rules so that standard services can be run without causing our score to go down. Finally, we looked at alerts and ticketing systems. Resources for Article: Further resources on this subject: VM, It Is Not What You Think! [article] vRealize Automation and the Deconstruction of Components [article] Deploying New Hosts with vCenter [article]

0
0
2967

article-image-introducing-swift-programming-language

Packt

19 Apr 2016

25 min read

Introducing the Swift Programming Language

Packt

19 Apr 2016

25 min read

0
0
46898

article-image-creating-your-own-node-module

Soham Kamani

18 Apr 2016

6 min read

Creating Your Own Node Module

Soham Kamani

18 Apr 2016

6 min read

Node.js has a great community and one of the best package managers I have ever seen. One of the reasons npm is so great is because it encourages you to make small composable modules, which usually have just one responsibility. Many of the larger, more complex node modules are built by composing smaller node modules. As of this writing, npm has over 219,897 packages. One of the reasons this community is so vibrant is because it is ridiculously easy to make your own node module. This post will go through the steps to create your own node module, as well as some of the best practices to follow while doing so. Prerequisites and Installation node and npm are a given. Additionally, you should also configure your npm author details: npm set init.author.name "My Name" npm set init.author.email "your@email.com" npm set init.author.url "http://your-website.com" npm adduser These are the details that would show up on npmjs.org once you publish. Hello World The reason that I say creating a node module is ridiculously easy is because you only need two files to create the most basic version of a node module. First up, create a package.json file inside of a new folder by running the npm init command. This will ask you to choose a name. Of course, the name you are thinking of might already exist in the npm registry, so to check for this run the command npm ls owner module_name , where module_name is replaced by the namespace you want to check. If it exists, you will get information about the authors: $ npm owner ls forever indexzero <charlie.robbins@gmail.com> bradleymeck <bradley.meck@gmail.com> julianduque <julianduquej@gmail.com> jeffsu <me@jeffsu.com> jcrugzz <jcrugzz@gmail.com> If your namespace is free you would get an error message. Something similar to : $ npm owner ls does_not_exist npm ERR! owner ls Couldnt get owner data does_not_exist npm ERR! Darwin 14.5.0 npm ERR! argv "node" "/usr/local/bin/npm" "owner" "ls" "does_not_exist" npm ERR! node v0.12.4 npm ERR! npm v2.10.1 npm ERR! code E404 npm ERR! 404 Registry returned 404 GET on https://registry.npmjs.org/does_not_exist npm ERR! 404 npm ERR! 404 'does_not_exist' is not in the npm registry. npm ERR! 404 You should bug the author to publish it (or use the name yourself!) npm ERR! 404 npm ERR! 404 Note that you can also install from a npm ERR! 404 tarball, folder, http url, or git url. npm ERR! Please include the following file with any support request: npm ERR! /Users/sohamchetan/Documents/jekyll-blog/npm-debug.log After setting up package.json, add a JavaScript file: module.exports = function(){ return 'Hello World!'; } And that's it! Now execute npm publish . and your node module will be published to npmjs.org. Also, anyone can now install your node module by running npm install --save module_name, where module name is the "name" property contained in package.json. Now anyone can use your module like this : var someModule = require('module_name'); console.log(someModule()); // This will output "Hello World!" Dependencies As stated before, rarely will you find large scale node modules that do not depend on other smaller modules. This is because npm encourages modularity and composability. To add dependancies to your own module, simply install them. For example, one of the most depended upon packages is lodash, a utility library. To add this, run the command : npm install --save lodash Now you can use lodash everywhere in your module by "requiring" it, and when someone else downloads your module, they get lodash bundled along with it as well. Additionally you would want to have some modules purely for development and not for distribution. These are dev-dependencies, and can be installed with the npm install --save-dev command. Dev dependencies will not install when someone else installs your node module. Configuring package.json The package.json file is what contains all the metadata for your node_module. A few fields are filled out automatically (like dependencies or devDependencies during npm installs). There are a few more fields in package.json that you should consider filling out so that your node module is best fitted to its purpose. "main": The relative path of the entry point of your module. Whatever is assigned to module.exports in this file is exported when someone "requires" your module. By default this is the index.js file. "keywords": It’s an array of keywords describing your module. Quite helpful when others from the community are searching for something that your module happens to solve. "license": I normally publish all my packages with an "MIT" licence because of its openness and popularity in the open source community. "version": This is pretty crucial because you cannot publish a node module with the same version twice. Normally, semver versioning should be followed. If you want to know more about the different properties you can set in package.json there's a great interactive guide you can check out. Using Yeoman Generators Although it's really simple to make a basic node module, it can be quite a task to make something substantial using just index.js nd package.json file. In these cases, there's a lot more to do, such as: Writing and running tests. Setting up a CI tool like Travis. Measuring code coverage. Installing standard dev dependencies for testing. Fortunately, there are many Yeoman generators to help you bootstrap your project. Check out generator-nm for setting up a basic project structure for a simple node module. If writing in ES6 is more your style, you can take a look at generator-nm-es6. These generators get your project structure, complete with a testing framework and CI integration so that you don't have to spend all your time writing boilerplate code. About the Author Soham Kamani is a full-stack web developer and electronics hobbyist. He is especially interested in JavaScript, Python, and IoT.

0
0
9226

Packt

18 Apr 2016

24 min read

Setting up a Build Chain with Grunt

Packt

18 Apr 2016

24 min read

0
0
35045

Packt

18 Apr 2016

15 min read

Configuring Redmine

Packt

18 Apr 2016

15 min read

In this article by Andriy Lesyuk, author of Mastering Redmine, whentalking about the web interface (that is, not system files), all of the global configuration of Redmine can be done on the Settings page of the Administration menu. This is actually the page that this articleis based around. Some settings on this page, however, depend on special system files or third-party tools that need to be installed. And these are the other things that we will discuss. You might expect to see detailed explanations for all the administration settings here, but instead, we will review in detail only a few of them, as I believe that the others do not need to be explained or can easily be tested. So generally, we will focus on hard-to-understand settings and thosesettings that need to be configured additionally in some special way or have some obscurities. So, why should you read this articleif you are not an administrator? Some features of Redmine are available only if they have been configured, so by reading this article, you will learn what extra features exist and get an idea of how to enable them. In this article, we will cover the following topics: The first thing to fix The general settings Authentication (For more resources related to this topic, see here.) The first thing to fix A fresh Redmine installation has only one user account, which has administrator privileges. You can see it in the following screenshot: This account is exactly the same by default on all Redmine installations. That's why it is extremely important to change its credentials immediately after you complete the installation, especially for Redmine instances that can be accessed publicly. The administrator credentials can be changed on the Users page of the Administration menu. To do this, click on the admin link. You will see this screen: In this form, you should specify a new password in the Password and Confirmation fields. Also, it's recommended that you change the login to something different. Additionally, consider specifying your e-mail instead of admin@example.net (at least), changing the First name and Last name. The general settings Everything that is possible to configure at the global level (the opposite is the project level) can be found under the Administration link in the top-left menu. Of course, this link is available only for administrators If you click on the Administrationlink, you will get the list of available administration pages on the sidebar to the right. Most of them are for managing Redmine objects, such as projects and trackers. We will be discussing only general, system-wide configuration. Most of the settings that we are going to review are compiled on the Settings page, as shown in the following screenshot: As all of these settings can't fit on a single page, Redmine organizes them into tabs. We will discuss the Authentication, Email notifications, Incoming emails, and Repositories tabs in the next sections. The General tab So let's start with the General tab, which can be seen in the previous screenshot. Settings in this tab control the general behavior of Redmine, thus Application title is the name of the website that is shown at the top of non-project pages, Welcome text is displayed on the start page of Redmine, Objects per page options specifies how many objects users will be able to see on a page, such settings as Search results per page and Days displayed on project activity allow to control the number of objects that are shown on search results and activity pages correspondingly, the Protocol setting specifies the preferred protocol that will be used in links to the website, Wiki history compression controls whether the history of Wiki changes should be compressed to save the space, and finally Maximum number of items in Atom feeds sets the limit for the amount of items that are returned in the Atom feed. Additionally, the General tab contains settings, which I want to discuss in detail. The Cache formatted text setting Redmine supports text formatting through the lightweight markup language Textile or Markdown. While conversion of text from such a language to HTML is quite fast, in some circumstances, you may want to cache the resulting HTML. If that is the case, the Cache formatted text checkbox is what you need. When this setting is enabled, all Textile or Markdown content that is larger than 2 KB will be cached. The cached HTML will be refreshed only when any changes are made to the source text, so you should take this into account if you are using a Wiki extension that generates the dynamic content (such as my WikiNG plugin). Unless performance is extremely critical for you, you should leave this checkbox unchecked. Other settings tips Here are some other tips for the General tab: The value of the Host name and path setting will be used to generate URLs in the e-mail messages that will be sent to users, so it's important to specify a proper value here. For the Text formatting, select the markup language that is best for you. It's also possible to select none here, but I would not recommend to do this. The Display tab As it comes from the name, this tab contains settings related to the look and feel of Redmine. Its settings can be seen in the following screenshot: Using the Theme setting users can choose a theme for the Redmine interface. The Default language setting allows to specify which language will be used for the interface, if Redmine fails to determine the language of the user. Thus, for not logged-in users it will attempt to use the preferred language of the user's browser, what can be disabled by the Force default language for anonymous users setting, and for logged-in users it will use the language that is chosen by users in their profiles, what can be disabled by the Force default language for logged-in users setting. By default the user's language also affects the start day of the week, and date and time formats, what can also be changed by the Start calendars on, Date format and Time format settings correspondingly. The display format of the user name is controlled by the Users display format setting. Finally, the Thumbnails size (in pixels) setting specifies the size of thumbnail images in pixels. Now let's check what the rest of settings mean. The Use Gravatar user icons setting Once I used a WordPress form to leave a comment on someone's blog. That form asked me to specify the first name, the last name, my e-mail address, and the text. After submitting it, I was surprised to see my photo near the comment. That's what Gravatar does. Gravatar stands for Globally Recognized Avatar. It's a web service that allows you to assign an image for each user's e-mail. Then, third-party sites can fetch the corresponding image by supplying a hash of the user's e-mail address. The Use Gravatar user icons setting enables this behavior for Redmine. Having this option checked is a good idea (unless potential users of your Redmine installation can be unable to access Internet because, for example, Redmine is going to be used in an isolated Intranet. The Default Gravatar image setting What happens if a Gravatar is not available for the user's e-mail? In such cases, the Gravatar service returns a default image, which depends on the Default Gravatar image setting. The following table shows the six available themes of the default avatar image: Theme Sample image Description None The default image, which is shown if no other theme is selected Wavatars A generated face with differing features and background Identicons A geometric pattern Monster IDs A generated monster image with different colors, face, and so on Retro A generated 8-bit, arcade-style pixelated face Mystery man A simple, cartoon-style silhouetted outline of a person For all of these themes, except Mystery manandnone, Gravatar generates an avatar image that is based on the hash of the user's e-mail and is therefore unique to it. The Redmine Local Avatars plugin Consider installing the Redmine Local Avatars plugin by Andrew Chaika, Luca Pireddu, and Ricardo Santos, if you preferwant users to upload their avatars directly onto Redmine: https://github.com/thorin/redmine_local_avatars This plugin will also let your users take their pictures with web cameras. The Display attachment thumbnails setting If the Display attachment thumbnails setting is enabled, all image attachments—no matter what object (for example, Wiki or issue) they are attached to—will be also seen under the attachment list as clickable thumbnails. If the user clicks on such a thumbnail, the full-size image will be opened. The Redmine Lightbox 2 plugin In pure Redmine, full-size images are opened in the same browser window. To open them in a lightbox, you can use the Lightbox 2 plugin that was created by Genki Zhang and Tobias Fischer: https://github.com/paginagmbh/redmine_lightbox2 Note that in order for this setting to work, you must have the ImageMagick's convert tool installed. The API tab In addition to the web interface that is intended for human Redmine comes with a special REST application programming interface (API) that is intended for third-party applications. Thus, Redmine REST API is used by Redmine Mylyn Connector for Eclipse and RedmineApp for iPhone. This interface can be enabled and configured under the API tab of the Settings page which is shown in the following screenshot: Let's check what these settings mean: If you need to support integration of third-party tools, you should turn on Redmine REST API using the Enable REST web service checkbox. But it is safe to keep this setting disabled, if you are not using any external Redmine tools. Redmine API can also be used via JavaScript in the web browser, but not if the API client (that is, a website, that runs JavaScript) is on different domain. In such cases to bypass the browser's same-origin policy the API client may use the technique called JSONP. As this technique is considered to be insecure it should be explicitly enabled using the Enable JSONP support setting. So in most cases you should leave this option disabled. The Files tab The Files tab contains settings related to file display and attachment as shown in the following screenshot: Here Allowed extensions and Disallowed extensions can be used to restrict file uploads by file extensions – thus you can use the former setting to allow certain extensions only or the latter one to forbid certain extensions only. Such settings as Maximum size of text files displayed inline and Maximum number of diff lines displayed control the amount of the file content that can be displayed. The rest settings are used more often: You may need to change the Maximum attachment size setting to a large value (which is in kB). Thus, project files (releases) are attachments as well, so if you expect your users to upload large files, consider changing this setting to a bigger value. The value of the Attachments and repositories encodings option is used to convert commit messages to UTF-8. Authentication There are two pages in Redmine intended for configuring the authentication. The first one is the Authentication tab on the Settings page, and the second one is the special LDAP Authentication page, which can be found in the Administration menu. Let's discuss these pages in detail. The Authentication tab The next tab in the administration settings is Authentication. The following screenshot shows the various options available under this tab: If the Authentication required setting is enabled, users won't be able to see the content of your Redmine without having logged in first. The Autologin setting can be used to let your users keep themselves logged in for some period of time using their browsers. The Self-registration setting controls, how user accounts are activated (the manual account activation option means that users should be enabled by administrators). The Allow users to delete their own account setting controls, whether users will be able to delete their accounts. The Minimum password length setting specifies the minimum size of the password in characters and the Require password change after setting can be used to force users to change their passwords periodically. The Lost password setting controls, whether users will be able to restore their passwords in cases when they, for example, have forgotten them. And finally the Maximum number of additional email addresses setting specifies the number of additional email addresses a user account may have. After a user logs in Redmine opens a user session. The lifetime of such session is controlled by the Session maximum lifetime setting (value disabled means that the session hangs forever). Such session can also be automatically terminated, if the user was not active for some time, what is controlled by the Session inactivity timeout setting (value disabled means that the session never expires). Now, let's discuss the very special setting, which we skipped. The Allow OpenID login and registration setting If you are running a public website with open registration, you perhaps know (or you will know if you want your Redmine installation to be public and open for user registration) that users do not like to register on each new site. This is understandable, as they do not want to create another password to remember or share their existing password with a new and therefore untrusted website. Besides, it's also a matter of sharing the e-mail address and—sometimes—remembering another login. That's when OpenID comes in handy. OpenID is an open-standard authentication protocol in which authentication (password verification) is performed by the OpenID provider. This popular protocol is currently supported by many companies, such as Yahoo!, PayPal, AOL, LiveJournal, IBM, VeriSign, and WordPress. In other words, servers of such companies can act as OpenID providers, and therefore users can log in to Redmine using their accounts that they have on these companies' websites if the Allow OpenID login and registration setting is enabled. Google used to support OpenID too, but they shut it down recently in favor of the OAuth2.0-based OpenID Connect authentication protocol. Despite the use of "OpenID" in its name, OpenID Connect is very different from OpenID. So, if your Redmine installation is (or is going to be) public, consider enabling this setting. But note that to log in using this protocol, your users will need to specify OpenID URL (the URL of the OpenID provider) in addition to Login and Password, as it can be seen on the following Redmine login form: LDAP authentication Just as OpenID is convenient for public sites to be used to authenticate external users, LDAP is convenient for private sites—to authenticate corporate users. Like OpenID, LDAP is a standard that describes how to authenticate against a special LDAP directory server, and is widely used by many applications such as MediaWiki, Apache, JIRA, Samba, SugarCRM, and so on. Also, as LDAP is an open protocol, it is supported by some other directory servers, such as Microsoft Active Directory and Apple Open Directory. For this reason, it is often used by companies as a centralized users' directory and an authentication server. To allow users to authenticate against an LDAP server, you should add it to the list of supported authentication modes on the LDAP authentication page, which is available in the Administration menu. To add a mode, click on the New authentication mode link. This will open the form: If the On-the-fly user creation option is checked, user accounts will be created automatically when users log in to the system for the first time. If this option is not checked, users will have to be added manually beforehand. Also, if you check this option, you need to specify all the attributes in the Attributes box, as they are going to be used to import user details from the LDAP server. Check with your LDAP server administrator to find out what values should be used in this form. In Redmine, LDAP authentication can be performed against many LDAP servers. Every such server is represented as an authentication source in the authentication mode list, which has just been mentioned. The corresponding source can also be seen in the user's profile and can even be changed to the internal Redmine authentication if needed. Summary I guess you have become a bit tired with all those general details, installations, configurations, integrations, and so on. You might expect to see detailed explanations for all the administration settings here, but instead, we will review in detail only a few of them, as I believe that the others do not need to be explained or can easily be tested. So generally, we will focus on hard-to-understand settings and those settings that need to be configured additionally in some special way or have some obscurities. Resources for Article: Further resources on this subject: Project management with Redmine [article] Redmine - Permissions and Security [article] Installing and customizing Redmine [article]

0
0
5530

How-To Tutorials

Packt

15 Apr 2016

17 min read

Finding Patterns in the Noise – Clustering and Unsupervised Learning

Packt

15 Apr 2016

17 min read

In this article by, Joseph J, author of Mastering Predictive Analytics with Python, we will cover one of the natural questions to ask about a dataset is if it contains groups. For example, if we examine financial markets as a time series of prices over time, are there groups of stocks that behave similarly over time? Likewise, in a set of customer financial transactions from an e-commerce business, are there user accounts distinguished by patterns of similar purchasing activity? By identifying groups using the methods described in this article, we can understand the data as a set of larger patterns rather than just individual points. These patterns can help in making high-level summaries at the outset of a predictive modeling project, or as an ongoing way to report on the shape of the data we are modeling. Likewise, the groupings produced can serve as insights themselves, or they can provide starting points for the models. For example, the group to which a datapoint is assigned can become a feature of this observation, adding additional information beyond its individual values. Additionally, we can potentially calculate statistics (such as mean and standard deviation) for other features within these groups, which may be more robust as model features than individual entries. (For more resources related to this topic, see here.) In contrast to the methods, grouping or clustering algorithms are known as unsupervised learning, meaning we have no response, such as a sale price or click-through rate, which is used to determine the optimal parameters of the algorithm. Rather, we identify similar datapoints, and as a secondary analysis might ask whether the clusters we identify share a common pattern in their responses (and thus suggest the cluster is useful in finding groups associated with the outcome we are interested in). The task of finding these groups, or clusters, has a few common ingredients that vary between algorithms. One is a notion of distance or similarity between items in the dataset, which will allow us to compare them. A second is the number of groups we wish to identify; this can be specified initially using domain knowledge, or determined by running an algorithm with different choices of initial groups to identify the best number of groups that describes a dataset, as judged by numerical variance within the groups. Finally, we need a way to measure the quality of the groups we've identified; this can be done either visually or through the statistics that we will cover. In this article we will dive into: How to normalize data for use in a clustering algorithm and to compute similarity measurements for both categorical and numerical data How to use k-means to identify an optimal number of clusters by examining the loss function How to use agglomerative clustering to identify clusters at different scales Using affinity propagation to automatically identify the number of clusters in a dataset How to use spectral methods to cluster data with nonlinear boundaries Similarity and distance The first step in clustering any new dataset is to decide how to compare the similarity (or dissimilarity) between items. Sometimes the choice is dictated by what kinds of similarity we are trying to measure, in others it is restricted by the properties of the dataset. In the following we illustrate several kinds of distance for numerical, categorical, time series, and set-based data—while this list is not exhaustive, it should cover many of the common use cases you will encounter in business analysis. We will also cover normalizations that may be needed for different data types prior to running clustering algorithms. Numerical distances Let's begin by looking at an example contained in the wine.data file. It contains a set of chemical measurements that describe the properties of different kinds of wines, and the class of quality (I-III) to which the wine is assigned (Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation, Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.). Open the file in an iPython notebook and look at the first few rows: Notice that in this dataset we have no column descriptions. We need to parse these from the dataset description file wine.data. With the following code, we generate a regular expression that will match a header name (we match a pattern where a number followed by a parenthesis has a column name after it, as you can see in the list of column names listed in the file), and add these to an array of column names along with the first column, which is the class label of the wine (whether it belongs to category I-III). We then assign this list to the dataframe column names: Now that we have appended the column names, we can look at a summary of the dataset: How can we calculate a similarity between wines based on this data? One option would be to consider each of the wines as a point in a thirteen-dimensional space specified by its dimensions (for example, each of the properties other than the class). Since the resulting space has thirteen dimensions, we can't directly visualize the datapoints using a scatterplot to see if they are nearby, but we can calculate distances just the same as with a more familiar 2- or 3-dimensional space using the Euclidean distance formula, which is simply the length of the straight line between two points. This formula for this length can be used whether the points are in a 2-dimensional plot or a more complex space such as this example, and is given by: Here aand bare rows of the dataset and nis the number of columns. One feature of the Euclidean distance is that columns whose scale is much different from others can distort it. In our example, the values describing the magnesium content of each wine are ~100 times greater than the magnitude of features describing the alcohol content or ash percentage. If we were to calculate the distance between these datapoints, it would largely be determined by the magnesium concentration (as even small differences on this scale overwhelmingly determine the value of the distance calculation), rather than any of its other properties. While this might sometimes be desirable, in most applications we do not favour one feature over another and want to give equal weight to all columns. To get a fair distance comparison between these points, we need to normalize the columns so that they fall into the same numerical range (have similar maxima and minima values). We can do so using the scale()function in scikit-learn: This function will subtract the mean value of a column from each element and then divide each point by the standard deviation of the column. This normalization centers each column at 0 with variance 1, and in the case of normally distributed data this would make a standard normal distribution. Also note that the scale() function returns a numpy dataframe, which is why we must call dataframe on the output to use the pandas function describe(). Now that we've scaled the data, we can calculate Euclidean distances between the points: We've now converted our dataset of 178 rows and 13 columns into a square matrix, giving the distance between each of these rows. In other words, row I, column j in this matrix represents the Euclidean distance between rows I and j in our dataset. This 'distance matrix' is the input we will use for clustering inputs in the following section. If we just want to get a visual sense of how the points compare to each other, we could use multidimensional scaling (MDS)—Modern Multidimensional Scaling - Theory and Applications Borg, I., Groenen P., Springer Series in Statistics (1997), Nonmetric multidimensional scaling: a numerical method, Kruskal, J. Psychometrika, 29 (1964), and Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Kruskal, J. Psychometrika, 29, (1964)—to create a visualization. Multidimensional scaling attempts to find the set of lower dimensional coordinates (here, two dimensions) that best represents the distances in the higher dimensions of a dataset (here, the pairwise Euclidean distances we calculated from the 13 dimensions). It does this by minimizing the coordinates (x, y) according to the strain function: Strain(x1…..xn) = (1 – Sum(ijdij*<xi,xj>)2/Sum(ij(dij**2)Sumij<xi,x,j>**2))1/2 Where d are the distances we've calculated between points. In other words, we find coordinates that best capture the variation in the distance through the variation in dot product the coordinates. We can then plot the resulting coordinates, using the wine class to label points in the diagram. Note that the coordinates themselves have no interpretation (in fact, they could change each time we run the algorithm). Rather, it is the relative position of points that we are interested in: Given that there are many ways we could have calculated the distance between datapoints, is the Euclidean distance a good choice here? Visually, based on the multidimensional scaling plot, we can see there is separation between the classes based on the features we've used to calculate distance, so conceptually it appears that this is a reasonable choice in this case. However, the decision also depends on what we are trying to compare; if we are interested in detecting wines with similar attributes in absolute values, then it is a good metric. However, what if we're not interested so much in the absolute composition of the wine, but whether its variables follow similar trends among wines with different alcohol contents? In this case, we wouldn't be interested in the absolute difference in values, but rather the correlationbetween the columns. This sort of comparison is common for time series, which we turn to next. Correlations and time series For time series data, we are often concerned with whether the patterns between series exhibit the same variation over time, rather than their absolute differences in value. For example, if we were to compare stocks, we might want to identify groups of stocks whose prices move up and down in similar patterns over time. The absolute price is of less interest than this pattern of increase and decrease. Let's look at an example of the Dow Jones industrial average over time (Brown, M. S., Pelosi, M., and Dirska, H. (2013). Dynamic-radius Species-conserving Genetic Algorithm for the Financial Forecasting of Dow Jones Index Stocks and Machine Learning and Data Mining in Pattern Recognition, 7988, 27-41.): This data contains the daily stock price (for 6 months) for a set of 30 stocks. Because all of the numerical values (the prices) are on the same scale, we won't normalize this data as with the wine dimensions. We notice two things about this data. First, the closing price per week (the variable we will use to calculate correlation) is presented as a string. Second, the date is not in the current format for plotting. We will process both columns to fix this, converting the columns to a float and datetime object, respectively: With this transformation, we can now make a pivot table to place the closing prices for week as columns and individual stocks as rows: As we can see, we only need columns 2 and onwards to calculate correlations between rows. Let's calculate the correlation between these time series of stock prices by selecting the second column to end columns of the data frame, calculating the pairwise correlations distance metric, and visualizing it using MDS, as before: It is important to note that the Pearson coefficient, which we've calculated here, is a measure of linearcorrelation between these time series. In other words, it captures the linear increase (or decrease) of the trend in one price relative to another, but won't necessarily capture nonlinear trends. We can see this by looking at the formula for the Pearson correlation, which is given by: P(a,b) = cov(a,b)/sd(a)/sd(b) = Sum(a-mean(b))*Sum(b-mean(b))/Sqrt(Sum(a-mean(a))2* Sqrt(Sum(b-mean(b)) This value varies from 1 (highly correlated) to -1 (inversely correlated), with 0 representing no correlation (such as a cloud of points). You might recognize the numerator of this equation as the covariance, which is a measure of how much two datasets, a and b, vary with one another. You can understand this by considering that the numerator is maximized when corresponding points in both datasets are above or below their mean value. However, whether this accurately captures the similarity in the data depends upon the scale. In data that is distributed in regular intervals between a maximum and minimum, with roughly the same difference between consecutive values (which is essentially how a trend line appears), it captures this pattern well. However, consider a case in which the data is exponentially distributed, with orders of magnitude differences between the minimum and maximum, and the difference between consecutive datapoints also varyies widely. Here, the Pearson correlation would be numerically dominated by only the largest terms, which might or might not represent the overall similarity in the data. This numerical sensitivity also occurs in the numerator, which represents the product of the standard deviations of both datasets. Thus, the value of the correlation is maximized when the variation in the two datasets is roughly explained by the product of their individual variations; there is no 'left over' variation between the datasets that is not explained by their respective standard deviations. Looking at the first two stocks in this dataset, this assumption of linearity appears to be a valid one for comparing datapoints: In addition to verifying that these stocks have a roughly linear correlation, this command introduces some new functions in pandas you may find useful. The first is iloc, which allows you to select indexed rows from a dataframe. The second is transpose, which inverts the rows and columns. Here, we select the first two rows, transpose, and then select all rows (prices) after the first (since the first is the Ticker symbol) Despite the trend we see in this example, we could imagine a nonlinear trend between prices. In these cases, it might be better to measure, not the linear correlation of the prices themselves, but whether the high prices for one stock coincide with another. In other words, the rank of market days by price should be the same, even if the prices are nonlinearly related. We can also calculate this rank correlation, also known as the Spearman's Rho, using scipy, with the following formula: Rho(a,b) = 6 * sum(d^2) / n (n2-1) Where n is the number of datapoints in each of two sets a and b, and d is the difference in ranks between each pair of datapoints ai and bi. Because we only compare the ranks of the data, not their actual values, this measure can capture variations up and down between two datasets, even if they vary over wide numerical ranges. Let's see if plotting the results using the Spearman correlation metric generates any differences in the pairwise distance of the stocks: The Spearman correlation distances, based on the x and y axes, appear closer to each other, suggesting from the perspective of rank correlation that the time series appear more similar. Though they differ in their assumptions about how the two compared datasets are distributed numerically, Pearson and Spearman correlations share the requirement that the two sets are of the same length. This is usually a reasonable assumption, and will be true of most of the examples we consider in this book. However, for cases where we wish to compare time series of unequal lengths, we can use Dynamic Time Warping (DTW). Conceptually, the idea of DTW is to warp one time series to align with a second, by allowing us to open gaps in either dataset so that it becomes the same size as the second. What the algorithm needs to resolve is where the most similar areas of the two series are, so that gaps can be places in the appropriate locations. In the simplest implementation, DTW consists of the following steps: For a dataset a of length n and a dataset n of length m, construct a matrix m by n. Set the top row and the leftmost column of this matrix to both be infinity. For each point i in set a, and point j in set b, compare their similarity using a cost function. To this cost function, add the minimum of the element (i-1, j-1), (i-1, j), and (j-1, i)—moving up and left, left, or up). These conceptually represent the costs of opening a gap in one of the series, versus aligning the same element in both. At the end of step 3, we will have traced the minimum cost path to align the two series, and the DTW distance will be represented by the bottommost corner of the matrix, (n.m). A negative aspect of this algorithm is that step 3 involves computing a value for every element of series a and b. For large time series or large datasets, this can be computationally prohibitive. While a full discussion of algorithmic improvements is beyond the scope of our present examples, we refer interested readers to FastDTW (which we will use in our example) and SparseDTW as examples of improvements that can be evaluated using many fewer calculations (Al-Naymat, G., Chawla, S., & Taheri, J. (2012), SparseDTW: A Novel Approach to Speed up Dynamic Time Warping and Stan Salvador and Philip Chan, FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space. KDD Workshop on Mining Temporal and Sequential Data, pages 70-80, 20043). We can use the FastDTW algorithm to compare the stocks data as well, and to plot the resulting coordinates. First we will compare pairwise each pair of stocks and record their DTW distance in a matrix: For computational efficiency (because the distance between i and j equals the distance between stocks j and i), we calculate only the upper triangle of this matrix. We then add the transpose (for example, the lower triangle) to this result to get the full distance matrix. Finally, we can use MDS again to plot the results: Compared to the distribution of coordinates along the x and y axis for Pearson correlation and rank correlation, the DTW distances appear to span a wider range, picking up more nuanced differences between the time series of stock prices. Now that we've looked at numerical and time series data, as a last example let's examine calculating similarity in categorical datasets. Summary In this section, we learned how to identify groups of similar items in a dataset, an exploratory analysis that we might frequently use as a first step in deciphering new datasets. We explored different ways of calculating the similarity between datapoints and described what kinds of data these metrics might best apply to. We examined both divisive clustering algorithms, which split the data into smaller components starting from a single group, and agglomerative methods, where every datapoint starts as its own cluster. Using a number of datasets, we showed examples where these algorithms will perform better or worse, and some ways to optimize them. We also saw our first (small) data pipeline, a clustering application in PySpark using streaming data. Resources for Article: Further resources on this subject: Python Data Structures[article] Big Data Analytics[article] Data Analytics[article]

0
0
12166

Using NoSQL Databases

Advanced Shell Topics

Introducing and Setting Up GO

Mobile Forensics and Its Challanges

Features of Sitecore

Getting Started with Apache Hadoop and Apache Spark

Introducing Dynamics CRM

Reusability patterns

Hello World Program

Working With Compliance

Trending Topics

Introducing the Swift Programming Language

Creating Your Own Node Module

Setting up a Build Chain with Grunt

Configuring Redmine

Finding Patterns in the Noise – Clustering and Unsupervised Learning

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access