Data | Tech News, Tutorials & Expert Insights

04 Feb 2015

28 min read

Working with Incanter Datasets

04 Feb 2015

0
0
3693

Packt

04 Mar 2015

20 min read

Writing Consumers

Packt

04 Mar 2015

20 min read

0
0
3687

article-image-how-to-maintain-apache-mesos

Vijin Boricha

13 Feb 2018

6 min read

How to maintain Apache Mesos

Vijin Boricha

13 Feb 2018

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by David Blomquist and Tomasz Janiszewski, titled Apache Mesos Cookbook. Throughout the course of the book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In this article, we will learn about configuring logging options, setting up monitoring ecosystem, and upgrading your Mesos cluster. Logging and debugging Here we will configure logging options that will allow us to debug the state of Mesos. Getting ready We will assume Mesos is available on localhost port 5050. The steps provided here will work for either master or agents. How to do it... When Mesos is installed from pre-built packages, the logs are by default stored in /var/log/mesos/. When installing from a source build, storing logs is disabled by default. To change the log store location, we need to edit /etc/default/mesos and set the LOGS variable to the desired destination. For some reason, mesos-init-wrapper does not transfer the contents of /etc/mesos/log_dir to the --log_dir flag. That's why we need to set the log's destination in the environment variable. Remember that only Mesos logs will be stored there. Logs from third-party applications (for example, ZooKeeper) will still be sent to STDERR. Changing the default logging level can be done in one of two ways: by specifying the -- logging_level flag or by sending a request and changing the logging level at runtime for a specific period of time. For example, to change the logging level to INFO, just put it in the following code: /etc/mesos/logging_level echo INFO > /etc/mesos/logging_level The possible levels are INFO, WARNING, and ERROR. For example, to change the logging level to the most verbose for 15 minutes for debug purposes, we need to send the following request to the logging/toggle endpoint: curl -v -X POST localhost:5050/logging/toggle?level=3&duration=15mins How it works... Mesos uses the Google-glog library for debugging, but third-party dependencies such as ZooKeeper have their own logging solution. All configuration options are backed by glog and apply only to Mesos core code. Monitoring Now, we will set up monitoring for Mesos. Getting ready We must have a running monitoring ecosystem. Metrics storage could be a simple time- series database such as graphite, influxdb, or prometheus. In the following example, we are using graphite and our metrics are published with http://diamond.readthedocs.io/en/latest/. How to do it... Monitoring is enabled by default. Mesos does not provide any way to automatically push metrics to the registry. However, it exposes them as a JSON that can be periodically pulled and saved into the metrics registry: Install Diamond using following command: pip install diamond If additional packages are required to install them, run: sudo apt-get install python-pip python-dev build-essential. pip (Pip Installs Packages) is a Python package manager used to install software written in Python. Configure the metrics handler and interval. Open /etc/diamond/diamond.conf and ensure that there is a section for graphite configuration: [handler_graphite] class = handlers.GraphiteHandler host = <graphite.host> port = <graphite.port> Remember to replace graphite.host and graphite.port with real graphite details. Enable the default Mesos Collector. Create configuration files diamond-setup - C MesosCollector. Check whether the configuration has proper values and edit them if needed. The configuration can be found in /etc/diamond/collectors/MesosCollector.conf. On master, this file should look like this: enabled = True host = localhost port = 5050 While on agent, the port could be different (5051), as follows: enabled = True host = localhost port = 5051 How it works... Mesos exposes metrics via the HTTP API. Diamond is a small process that periodically pulls metrics, parses them, and sends them to the metrics registry, in this case, graphite. The default implementation of Mesos Collector does not store all the available metrics so it's recommended to write a custom handler that will collect all the interesting information. See also... Metrics could be read from the following endpoints: http://mesos.apache.org/documentation/latest/endpoints/metrics/snapshot/ http://mesos.apache.org/documentation/latest/endpoints/slave/monitor/statistics/ http://mesos.apache.org/documentation/latest/endpoints/slave/state/ Upgrading Mesos In this recipe, you will learn how to upgrade your Mesos cluster. How to do it... Mesos release cadence is at least one release per quarter. Minor releases are backward compatible, although there could be some small incompatibilities or the dropping of deprecated methods. The recommended method of upgrading is to apply all intermediate versions. For example, to upgrade from 0.27.2 to 1.0.0, we should apply 0.28.0, 0.28.1, 0.28.2, and finally 1.0.0. If the agent's configuration changes, clearing the metadata directory is required. You can do this with the following code: rm -rv {MESOS_DIR}/metadata Here, {MESOS_DIR} should be replaced with the configured Mesos directory. Rolling upgrades is the preferred method of upgrading clusters, starting with masters and then agents. To minimize the impact on running tasks, if an agent's configuration changes and it becomes inaccessible, then it should be switched to maintenance mode. How it works... Configuration changes may require clearing the metadata because the changes may not be backward compatible. For example, when an agent runs with different isolators, it shouldn't attach to the already running processes without this isolator. The Mesos architecture will guarantee that the executors that were not attached to the Mesos agent will commit suicide after a configurable amount of time (--executor_registration_timeout). Maintenance mode allows you to declare the time window during which the agent will be inaccessible. When this occurs, Mesos will send a reverse offer to all the frameworks to drain that particular agent. The frameworks are responsible for shutting down its task and spawning it on another agent. The Maintenance mode is applied, even if the framework does not implement the HTTP API or is explicitly declined. Using maintenance mode can prevent restarting tasks multiple times. Consider the following example with five agents and one task, X. We schedule the rolling upgrade of all the agents. Task X is deployed on agent 1. When it goes down, it's moved to 2, then to 3, and so on. This approach is extremely inefficient because the task is restarted five times, but it only needs to be restarted twice. Maintenance mode enables the framework to optimally schedule the task to run on agent 5 when 1 goes down, and then return to 1 when 5 goes down: Worst case scenario of rolling upgrade without maintenance mode legend optimal solution of rolling upgrade with maintenance mode. We have learnt about running and maintaining Mesos. To know more about managing containers and understanding the scheduler API you may check out this book, Apache Mesos Cookbook.

0
0
3686

Packt

02 May 2013

10 min read

Ten IPython essentials

Packt

02 May 2013

10 min read

(For more resources related to this topic, see here.) Running the IPython console If IPython has been installed correctly, you should be able to run it from a system shell with the ipython command. You can use this prompt like a regular Python interpreter as shown in the following screenshot: Command-line shell on Windows If you are on Windows and using the old cmd.exe shell, you should be aware that this tool is extremely limited. You could instead use a more powerful interpreter, such as Microsoft PowerShell, which is integrated by default in Windows 7 and 8. The simple fact that most common filesystem-related commands (namely, pwd, cd, ls, cp, ps, and so on) have the same name as in Unix should be a sufficient reason to switch. Of course, IPython offers much more than that. For example, IPython ships with tens of little commands that considerably improve productivity. Some of these commands help you get information about any Python function or object. For instance, have you ever had a doubt about how to use the super function to access parent methods in a derived class? Just type super? (a shortcut for the command %pinfo super) and you will find all the information regarding the super function. Appending ? or ?? to any command or variable gives you all the information you need about it, as shown here: In [1]: super? Typical use to call a cooperative superclass method: class C(B): def meth(self, arg): super(C, self).meth(arg) Using IPython as a system shell You can use the IPython command-line interface as an extended system shell. You can navigate throughout your ﬁlesystem and execute any system command. For instance, the standard Unix commands pwd, ls, and cd are available in IPython and work on Windows too, as shown in the following example: In [1]: pwd Out[1]: u'C:' In [2]: cd windows C:windows These commands are particular magic commands that are central in the IPython shell. There are dozens of magic commands and we will use a lot of them throughout this book. You can get a list of all magic commands with the %lsmagic command. Using the IPython magic commands Magic commands actually come with a % prefix, but the automagic system, enabled by default, allows you to conveniently omit this prefix. Using the prefix is always possible, particularly when the unprefixed command is shadowed by a Python variable with the same name. The %automagic command toggles the automagic system. In this book, we will generally use the % prefix to refer to magic commands, but keep in mind that you can omit it most of the time, if you prefer. Using the history Like the standard Python console, IPython offers a command history. However, unlike in Python's console, the IPython history spans your previous interactive sessions. In addition to this, several key strokes and commands allow you to reduce repetitive typing. In an IPython console prompt, use the up and down arrow keys to go through your whole input history. If you start typing before pressing the arrow keys, only the commands that match what you have typed so far will be shown. In any interactive session, your input and output history is kept in the In and Out variables and is indexed by a prompt number. The _, __, ___ and _i, _ii, _iii variables contain the last three output and input objects, respectively. The _n and _in variables return the nth output and input history. For instance, let's type the following command: In [4]: a = 12 In [5]: a ** 2 Out[5]: 144 In [6]: print("The result is {0:d}.".format(_)) The result is 144. In this example, we display the output, that is, 144 of prompt 5 on line 6. Tab completion Tab completion is incredibly useful and you will ﬁnd yourself using it all the time. Whenever you start typing any command, variable name, or function, press the Tab key to let IPython either automatically complete what you are typing if there is no ambiguity, or show you the list of possible commands or names that match what you have typed so far. It also works for directories and ﬁle paths, just like in the system shell. It is also particularly useful for dynamic object introspection. Type any Python object name followed by a point and then press the Tab key; IPython will show you the list of existing attributes and methods, as shown in the following example: In [1]: import os In [2]: os.path.split<tab> os.path.split os.path.splitdrive os.path.splitext os.path.splitunc In the second line, as shown in the previous code, we press the Tab key after having typed os.path.split. IPython then displays all the possible commands. Tab Completion and Private Variables Tab completion shows you all the attributes and methods of an object, except those that begin with an underscore (_). The reason is that it is a standard convention in Python programming to prefix private variables with an underscore. To force IPython to show all private attributes and methods, type myobject._ before pressing the Tab key. Nothing is really private or hidden in Python. It is part of a general Python philosophy, as expressed by the famous saying, "We are all consenting adults here." Executing a script with the %run command Although essential, the interactive console becomes limited when running sequences of multiple commands. Writing multiple commands in a Python script with the .py file extension (by convention) is quite common. A Python script can be executed from within the IPython console with the %run magic command followed by the script filename. The script is executed in a fresh, new Python namespace unless the -i option has been used, in which case the current interactive Python namespace is used for the execution. In all cases, all variables defined in the script become available in the console at the end of script execution. Let's write the following Python script in a file called script.py: print("Running script.") x = 12 print("'x' is now equal to {0:d}.".format(x)) Now, assuming we are in the directory where this file is located, we can execute it in IPython by entering the following command: In [1]: %run script.py Running script. 'x' is now equal to 12. In [2]: x Out[2]: 12 When running the script, the standard output of the console displays any print statement. At the end of execution, the x variable defined in the script is then included in the interactive namespace, which is quite convenient. Quick benchmarking with the %timeit command You can do quick benchmarks in an interactive session with the %timeit magic command. It lets you estimate how much time the execution of a single command takes. The same command is executed multiple times within a loop, and this loop itself is repeated several times by default. The individual execution time of the command is then automatically estimated with an average. The -n option controls the number of executions in a loop, whereas the -r option controls the number of executed loops. For example, let's type the following command: In[1]: %timeit [x*x for x in range(100000)] 10 loops, best of 3: 26.1 ms per loop Here, it took about 26 milliseconds to compute the squares of all integers up to 100000. Quick debugging with the %debug command IPython ships with a powerful command-line debugger. Whenever an exception is raised in the console, use the %debug magic command to launch the debugger at the exception point. You then have access to all the local variables and to the full stack traceback in postmortem mode. Navigate up and down through the stack with the u and d commands and exit the debugger with the q command. See the list of all the available commands in the debugger by entering the ? command. You can use the %pdb magic command to activate the automatic execution of the IPython debugger as soon as an exception is raised. Interactive computing with Pylab The %pylab magic command enables the scientific computing capabilities of the NumPy and matplotlib packages, namely efficient operations on vectors and matrices and plotting and interactive visualization features. It becomes possible to perform interactive computations in the console and plot graphs dynamically. For example, let's enter the following command: In [1]: %pylab Welcome to pylab, a matplotlib-based Python environment [backend: TkAgg]. For more information, type 'help(pylab)'. In [2]: x = linspace(-10., 10., 1000) In [3]: plot(x, sin(x)) In this example, we first define a vector of 1000 values linearly spaced between -10 and 10. Then we plot the graph (x, sin(x)). A window with a plot appears as shown in the following screenshot, and the console is not blocked while this window is opened. This allows us to interactively modify the plot while it is open. Using the IPython Notebook The Notebook brings the functionality of IPython into the browser for multiline textediting features, interactive session reproducibility, and so on. It is a modern and powerful way of using Python in an interactive and reproducible way To use the Notebook, call the ipython notebook command in a shell (make sure you have installed the required dependencies). This will launch a local web server on the default port 8888. Go to http://127.0.0.1:8888/ in a browser and create a new Notebook. You can write one or several lines of code in the input cells. Here are some of the most useful keyboard shortcuts: Press the Enter key to create a new line in the cell and not execute the cell Press Shift + Enter to execute the cell and go to the next cell Press Alt + Enter to execute the cell and append a new empty cell right after it Press Ctrl + Enter for quick instant experiments when you do not want to save the output Press Ctrl + M and then the H key to display the list of all the keyboard shortcuts Customizing IPython You can save your user preferences in a Python ﬁle; this ﬁle is called an IPython proﬁle. To create a default proﬁle, type ipython profile create in a shell. This will create a folder named profile_default in the ~/.ipython or ~/.config/ ipython directory. The ﬁle ipython_config.py in this folder contains preferences about IPython. You can create different proﬁles with different names using ipython profile create profilename, and then launch IPython with ipython --profile=profilename to use that proﬁle. The ~ directory is your home directory, for example, something like /home/ yourname on Unix, or C:Usersyourname or C:Documents and Settings yourname on Windows. Summary We have gone through 10 of the most interesting features offered by IPython in this article. They essentially concern the Python and shell interactive features, including the integrated debugger and proﬁler, and the interactive computing and visualization features brought by the NumPy and Matplotlib packages. Resources for Article : Further resources on this subject: Advanced Matplotlib: Part 1 [Article] Python Testing: Installing the Robot Framework [Article] Running a simple game using Pygame [Article]

0
0
3681

Packt

26 Aug 2014

7 min read

Stream Grouping

Packt

26 Aug 2014

7 min read

In this article, by Ankit Jain and Anand Nalya, the authors of the book Learning Storm, we will cover different types of stream groupings. (For more resources related to this topic, see here.) When defining a topology, we create a graph of computation with a number of bolt-processing streams. At a more granular level, each bolt executes as multiple tasks in the topology. A stream will be partitioned into a number of partitions and divided among the bolts' tasks. Thus, each task of a particular bolt will only get a subset of the tuples from the subscribed streams. Stream grouping in Storm provides complete control over how this partitioning of tuples happens among many tasks of a bolt subscribed to a stream. Grouping for a bolt can be defined on the instance of the backtype.storm.topology.InputDeclarer class returned when defining bolts using the backtype.storm.topology.TopologyBuilder.setBolt method. Storm supports the following types of stream groupings: Shuffle grouping Fields grouping All grouping Global grouping Direct grouping Local or shuffle grouping Custom grouping Now, we will look at each of these groupings in detail. Shuffle grouping Shuffle grouping distributes tuples in a uniform, random way across the tasks. An equal number of tuples will be processed by each task. This grouping is ideal when you want to distribute your processing load uniformly across the tasks and where there is no requirement of any data-driven partitioning. Fields grouping Fields grouping enables you to partition a stream on the basis of some of the fields in the tuples. For example, if you want that all the tweets from a particular user should go to a single task, then you can partition the tweet stream using fields grouping on the username field in the following manner: builder.setSpout("1", new TweetSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")) Fields grouping is calculated with the following function: hash (fields) % (no. of tasks) Here, hash is a hashing function. It does not guarantee that each task will get tuples to process. For example, if you have applied fields grouping on a field, say X, with only two possible values, A and B, and created two tasks for the bolt, then it might be possible that both hash (A) % 2 and hash (B) % 2 are equal, which will result in all the tuples being routed to a single task and other tasks being completely idle. Another common usage of fields grouping is to join streams. Since partitioning happens solely on the basis of field values and not the stream type, we can join two streams with any common join fields. The name of the fields do not need to be the same. For example, in order to process domains, we can join the Order and ItemScanned streams when an order is completed: builder.setSpout("1", new OrderSpout()); builder.setSpout("2", new ItemScannedSpout()); builder.setBolt("joiner", new OrderJoiner()) .fieldsGrouping("1", new Fields("orderId")) .fieldsGrouping("2", new Fields("orderRefId")); All grouping All grouping is a special grouping that does not partition the tuples but replicates them to all the tasks, that is, each tuple will be sent to each of the bolt's tasks for processing. One common use case of all grouping is for sending signals to bolts. For example, if you are doing some kind of filtering on the streams, then you have to pass the filter parameters to all the bolts. This can be achieved by sending those parameters over a stream that is subscribed by all bolts' tasks with all grouping. Another example is to send a reset message to all the tasks in an aggregation bolt. The following is an example of all grouping: builder.setSpout("1", new TweetSpout()); builder.setSpout("signals", new SignalSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")).allGrouping("signals"); Here, we are subscribing signals for all the TweetCounter bolt's tasks. Now, we can send different signals to the TweetCounter bolt using SignalSpout. Global grouping Global grouping does not partition the stream but sends the complete stream to the bolt's task with the smallest ID. A general use case of this is when there needs to be a reduce phase in your topology where you want to combine results from previous steps in the topology in a single bolt. Global grouping might seem redundant at first, as you can achieve the same results with defining the parallelism for the bolt as one and setting the number of input streams to one. Though, when you have multiple streams of data coming through different paths, you might want only one of the streams to be reduced and others to be processed in parallel. For example, consider the following topology. In this topology, you might want to route all the tuples coming from Bolt C to a single Bolt D task, while you might still want parallelism for tuples coming from Bolt E to Bolt D. Global grouping This can be achieved with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setSpout("b", new SpoutB()); builder.setBolt("c", new BoltC()); builder.setBolt("e", new BoltE()); builder.setBolt("d", new BoltD()) .globalGrouping("c") .shuffleGrouping("e"); Direct grouping In direct grouping, the emitter decides where each tuple will go for processing. For example, say we have a log stream and we want to process each log entry using a specific bolt task on the basis of the type of resource. In this case, we can use direct grouping. Direct grouping can only be used with direct streams. To declare a stream as a direct stream, use the backtype.storm.topology.OutputFieldsDeclarer.declareStream method that takes a Boolean parameter directly in the following way in your spout: @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declareStream("directStream", true, new Fields("field1")); } Now, we need the number of tasks for the component so that we can specify the taskId parameter while emitting the tuple. This can be done using the backtype.storm.task.TopologyContext.getComponentTasks method in the prepare method of the bolt. The following snippet stores the number of tasks in a bolt field: public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { this.numOfTasks = context.getComponentTasks("my-stream"); this.collector = collector; } Once you have a direct stream to emit to, use the backtype.storm.task.OutputCollector.emitDirect method instead of the emit method to emit it. The emitDirect method takes a taskId parameter to specify the task. In the following snippet, we are emitting to one of the tasks randomly: public void execute(Tuple input) { collector.emitDirect(new Random().nextInt(this.numOfTasks), process(input)); } Local or shuffle grouping If the tuple source and target bolt tasks are running in the same worker, using this grouping will act as a shuffle grouping only between the target tasks running on the same worker, thus minimizing any network hops resulting in increased performance. In case there are no target bolt tasks running on the source worker process, this grouping will act similar to the shuffle grouping mentioned earlier. Custom grouping If none of the preceding groupings fit your use case, you can define your own custom grouping by implementing the backtype.storm.grouping.CustomStreamGrouping interface. The following is a sample custom grouping that partitions a stream on the basis of the category in the tuples: public class CategoryGrouping implements CustomStreamGrouping, Serializable { // Mapping of category to integer values for grouping private static final Map<String, Integer> categories = ImmutableMap.of ( "Financial", 0, "Medical", 1, "FMCG", 2, "Electronics", 3 ); // number of tasks, this is initialized in prepare method private int tasks = 0; public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) { // initialize the number of tasks tasks = targetTasks.size(); } public List<Integer> chooseTasks(int taskId, List<Object> values) { // return the taskId for a given category String category = (String) values.get(0); return ImmutableList.of(categories.get(category) % tasks); } } Now, we can use this grouping in our topologies with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setBolt("b", (IRichBolt)new BoltB()) .customGrouping("a", new CategoryGrouping()); The following diagram represents the Storm groupings graphically: Summary In this article, we discussed stream grouping in Storm and its types. Resources for Article: Further resources on this subject: Integrating Storm and Hadoop [article] Deploying Storm on Hadoop for Advertising Analysis [article] Photo Stream with iCloud [article]

0
0
3647

Packt

26 Oct 2015

5 min read

Making 3D Visualizations

Packt

26 Oct 2015

5 min read

Python has become the preferred language of data scientists for data analysis, visualization, and machine learning. It features numerical and mathematical toolkits such as: Numpy, Scipy, Sci-kit learn, Matplotlib and Pandas, as well as a R-like environment with IPython, all used for data analysis, visualization and machine learning. In this article by Dimitry Foures and Giuseppe Vettigli, authors of the book Python Data Visualization Cookbook, Second Edition, we will see how visualization in 3D is sometimes effective and sometimes inevitable. In this article, you will learn the how 3D bars are created. (For more resources related to this topic, see here.) Creating 3D bars Although matplotlib is mainly focused on plotting and 2D, there are different extensions that enable us to plot over geographical maps, to integrate more with Excel, and plot in 3D. These extensions are called toolkits in matplotlib world. A toolkit is a collection of specific functions that focuses on one topic, such as plotting in 3D. Popular toolkits are Basemap, GTK Tools, Excel Tools, Natgrid, AxesGrid, and mplot3d. We will explore more of mplot3d in this recipe. The mpl_toolkits.mplot3d toolkit provides some basic 3D plotting. Plots supported are scatter, surf, line, and mesh. Although this is not the best 3D plotting library, it comes with matplotlib, and we are already familiar with this interface. Getting ready Basically, we still need to create a figure and add desired axes to it. Difference is that we specify 3D projection for the figure, and the axes we add is Axes3D. Now, we can almost use the same functions for plotting. Of course, the difference is the arguments passed. For we now have three axes, which we need to provide data for. For example, the mpl_toolkits.mplot3d.Axes3D.plot function specifies the xs, ys, zs, and zdir arguments. All others are transferred directly to matplotlib.axes.Axes.plot. We will explain these specific arguments: xs,ys: These are coordinates for X and Y axis zs: These are value(s) for Z axis. Can be one for all points, or one for each point zdir: These values choose what will be the z-axis dimension (usually this is zs, but can be xs, or ys) There is a rotate_axes method in module mpl_toolkits.mplot3d.art3d that contains 3D artist code and functions to convert 2D artists into 3D versions, which can be added to an Axes3D to reorder coordinates so that the axes are rotated with zdir along. The default value is z. Prepending the axis with a '-' does the inverse transform, so zdir can be x, -x, y, -y, z, or -z. How to do it... This is the code to demonstrate the plotting concept explained in the preceding section: import random import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.dates as mdates from mpl_toolkits.mplot3d import Axes3D mpl.rcParams['font.size'] = 10 fig = plt.figure() ax = fig.add_subplot(111, projection='3d') for z in [2011, 2012, 2013, 2014]: xs = xrange(1,13) ys = 1000 * np.random.rand(12) color = plt.cm.Set2(random.choice(xrange(plt.cm.Set2.N))) ax.bar(xs, ys, zs=z, zdir='y', color=color, alpha=0.8) ax.xaxis.set_major_locator(mpl.ticker.FixedLocator(xs)) ax.yaxis.set_major_locator(mpl.ticker.FixedLocator(ys)) ax.set_xlabel('Month') ax.set_ylabel('Year') ax.set_zlabel('Sales Net [usd]') plt.show() This code produces the following figure: How it works... We had to do the same prep work as in 2D world. Difference here is that we needed to specify what "kind of backend." Then, we generate random data for supposed 4 years of sale (2011–2014). We needed to specify Z values to be the same for the 3D axis. The color we picked randomly from the color map set, and then we associated each Z order collection of xs, ys pairs we would render the bar series. There's more... Other plotting from 2D matplotlib are available here. For example, scatter() has a similar interface to plot(), but with added size of the point marker. We are also familiar with contour, contourf, and bar. New types that are available only in 3D are wireframe, surface, and tri-surface plots. For example, this code example, plots tri-surface plot of popular Pringle functions or, more mathematically, hyperbolic paraboloid: from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm import matplotlib.pyplot as plt import numpy as np n_angles = 36 n_radii = 8 # An array of radii # Does not include radius r=0, this is to eliminate duplicate points radii = np.linspace(0.125, 1.0, n_radii) # An array of angles angles = np.linspace(0, 2*np.pi, n_angles, endpoint=False) # Repeat all angles for each radius angles = np.repeat(angles[...,np.newaxis], n_radii, axis=1) # Convert polar (radii, angles) coords to cartesian (x, y) coords # (0, 0) is added here. There are no duplicate points in the (x, y) plane x = np.append(0, (radii*np.cos(angles)).flatten()) y = np.append(0, (radii*np.sin(angles)).flatten()) # Pringle surface z = np.sin(-x*y) fig = plt.figure() ax = fig.gca(projection='3d') ax.plot_trisurf(x, y, z, cmap=cm.jet, linewidth=0.2) plt.show() The code will give the following output: Summary Python Data Visualization Cookbook, Second Edition, is for developers that already know about Python programming in general. If you have heard about data visualization but you don't know where to start, then the book will guide you from the start and help you understand data, data formats, data visualization, and how to use Python to visualize data. Many more visualization techniques have been illustrated in a step-by-step recipe-based approach to data visualization in the book. The topics are explained sequentially as cookbook recipes consisting of a code snippet and the resulting visualization. Resources for Article: Further resources on this subject: Basics of Jupyter Notebook and Python [article] Asynchronous Programming with Python [article] Introduction to Data Analysis and Libraries [article]

0
0
3644

Packt

04 Jul 2016

16 min read

Data Science with R

Packt

04 Jul 2016

16 min read

0
0
3621

Packt

25 Oct 2013

28 min read

About Cassandra

Packt

25 Oct 2013

28 min read

(For more resources related to this topic, see here.) So, if Cassandra is so good at everything, why not everyone drop whatever database they are using and jump start with Cassandra? This is a natural question. Some applications require strong ACID compliance, such as a booking system. If you are a person who goes by statistics, you'd ask how Cassandra fares with other existing data stores. TilmannRabl et al in their paper, Solving Big Data Challenges for Enterprise Application Performance Management (http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf), told that, "In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput from one to 12 nodes. This comes at the price of a high write and read latency. Cassandra's performance is best for high insertion rates." If you go through the paper, Cassandra wins in almost all the criteria. Equipped with proven concepts of distributed computing, made to reliably serve from commodity servers, and simple and easy maintenance, Cassandra is one of the most scalable, fastest, and very robust NoSQL database. So, the next natural question is what makes Cassandra so blazing fast? Let us dive deeper into the Cassandra architecture. Cassandra architecture Cassandra is a relative latecomer in the distributed data-store war. It takes advantage of two proven and closely similar data-store mechanisms, namely Google BigTable, a distributed storage system for structured data, 2006 (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf [2006]), and Amazon Dynamo, Amazon's highly available key-value store, 2007 (http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf [2007]). Figure 2.3: Read throughputs shows linear scaling of Cassandra Like BigTable, it has tabular data presentation. It is not tabular in the strictest sense. It is rather a dictionary-like structure where each entry holds another sorted dictionary/map. This model is more powerful than the usual key-value store and it is named as column family. The properties such as Eventual Consistency and decentralization are taken from Dynamo. For now, assume a column family as a giant spreadsheet, such as MS Excel. But unlike spreadsheets, each row is identified by a row key with a number (token), and unlike spreadsheets, each cell can have its own unique name within the row. The columns in the rows are sorted by this unique column name. Also, since the number of rows is allowed to be very large (1.7*(10)^38), we distribute the rows uniformly across all the available machines by dividing the rows in equal token groups. These rows create a Keyspace. Keyspace is a set of all the tokens (row IDs). Ring representation Cassandra cluster is denoted as a ring. The idea behind this representation is to show token distribution. Let's take an example. Assume that there is a partitioner that generates tokens from zero to 127 and you have four Cassandra machines to create a cluster. To allocate equal load, we need to assign each of the four nodes to bear an equal number of tokens. So, the first machine will be responsible for tokens one to 32, the second will hold 33 to 64, the third, 65 to 96, and the fourth, 97 to 127 and 0. If you mark each node with the maximum token number that it can hold, the cluster looks like a ring. (Figure 2.22) Partitioner is the hash function that determines the range of possible row keys. Cassandra uses a partitioner to calculate the token equivalent to a row key (row ID). Figure 2.4: Token ownership and distribution in a balanced Cassandra ring When you start to configure Cassandra, one thing that you may want to set is the maximum token number that a particular machine could hold. This property can be set in the Cassandra.yaml file as initial_token. One thing that may confuse a beginner is that the value of the initial token is what the last token owns. Be aware that nodes can be rebalanced and these tokens can be changed as the new nodes join or old nodes get discarded. This is the initial token because this is just the initial value, and it may be changed later. How Cassandra works Diving into various components of Cassandra without having a context is really a frustrating experience. It does not makes sense why you are studying SSTable, MemTable, and Log Structured Merge (LSM) tree without being able to see how they fit into functionality and performance guarantees that Cassandra gives. So, first, we will see Cassandra's write and read mechanism. It is possible that some of the terms that we encounter during this discussion may not be immediately understandable. A rough overview of the Cassandra components is as shown in the following figure: Figure 2.5: Main components of the Cassandra service The main class of Storage Layer is StorageProxy. It handles all the requests. Messaging Layer is responsible for internode communications like gossip. Apart from this, process-level structures keep a rough idea about the actual data containers and where they live. There are four data buckets that you need to know. MemTable is a hash table-like structure that stays in memory. It contains actual column data. SSTable is the disk version of MemTables. When MemTables are full, SSTables are persisted to the hard disk. Bloom filters is a probabilistic data structure that lives in memory. It helps Cassandra to quickly detect which SSTable does not have the requested data. CommitLog is the usual commit log that contains all the mutations that are to be applied. It lives on the disk and helps to replay uncommitted changes. With this primer, we can start looking into how write and read works in Cassandra. We will see more explanation later. Write in action To write, clients need to connect to any of the Cassandra nodes and send a write request. This node is called as the coordinator node. When a node in Cassandra cluster receives a write request, it delegates it to a service called StorageProxy. This node may or may not be the right place to write the data to. The task of StorageProxy is to get the nodes (all the replicas) that are responsible to hold the data that is going to be written. It utilizes a replication strategy to do that. Once the replica nodes are identified, it sends the RowMutation message to them, the node waits for replies from these nodes, but it does not wait for all the replies to come. It only waits for as many responses as are enough to satisfy the client's minimum number of successful writes defined by ConsistencyLevel. So, the following figure and steps after that show all that can happen during a write mechanism: Figure 2.6: A simplistic representation of the write mechanism. The figure on the left represents the node-local activities on receipt of the write request If FailureDetector detects that there aren't enough live nodes to satisfy ConsistencyLevel, the request fails. If FailureDetector gives a green signal, but writes time-out after the request is sent due to infrastructure problems or due to extreme load, StorageProxy writes a local hint to replay when the failed nodes come back to life. This is called hinted handoff. One might think that hinted handoff may be responsible for Cassandra's eventual consistency. But it's not entirely true. If the coordinator node gets shut down or dies due to hardware failure and hints on this machine cannot be forwarded, eventual consistency will not occur. The Anti-entropy mechanism is responsible for consistency rather than hinted handoff. Anti-entropy makes sure that all replicas are in sync. If the replica nodes are distributed across datacenters, it will be a bad idea to send individual messages to all the replicas in other datacenters. It rather sends the message to one replica in each datacenter with a header instructing it to forward the request to other replica nodes in that datacenter. Now, the data is received by the node that should actually store that data. The data first gets appended to CommitLog, and pushed to a MemTable for the appropriate column family in the memory. When MemTable gets full, it gets flushed to the disk in a sorted structure named SSTable. With lots of flushes, the disk gets plenty of SSTables. To manage SSTables, a compaction process runs. This process merges data from smaller SSTables to one big sorted file. Read in action Similar to a write case, when StorageProxy of the node that a client is connected to gets the request, it gets a list of nodes containing this key based on Replication Strategy. StorageProxy then sorts the nodes based on their proximity to itself. The proximity is determined by the Snitch function that is set up for this cluster. Basically, there are the following types of Snitch: SimpleSnitch: A closer node is the one that comes first when moving clockwise in the ring. (A ring is when all the machines in the cluster are placed in a circular fashion with each having a token number. When you walk clockwise, the token value increases. At the end, it snaps back to the first node.) AbstractNetworkTopologySnitch: Implementation of the Snitch function works like this: nodes on the same rack are closest. The nodes in the same datacenter but in different rack are closer than those in other datacenters, but farther than the nodes in the same rack. Nodes in different datacenters are the farthest. To a node, the nearest node will be the one on the same rack. If there is no node on the same rack, the nearest node will be the one that lives in the same datacenter, but on a different rack. If there is no node in the datacenter, any nearest neighbor will be the one in the other datacenter. DynamicSnitch: This Snitch determines closeness based on recent performance delivered by a node. So, a quick-responding node is perceived closer than a slower one, irrespective of their location closeness or closeness in the ring. This is done to avoid overloading a slow-performing node. Now that we have the list of nodes that have desired row keys, it's time to pull data from them. The coordinator node (the one that the client is connected to) sends a command to the closest node to perform read (we'll discuss local read in a minute) and return the data. Now, based on ConsistencyLevel, other nodes will send a command to perform a read operation and send just the digest of the result. If we have Read Repair (discussed later) enabled, the remaining replica nodes will be sent a message to compute the digest of the command response. Let's take an example: say you have five nodes containing a row key K (that is, replication factor (RF) equals 5). Your read ConsistencyLevel is three. Then the closest of the five nodes will be asked for the data. And the second and third closest nodes will be asked to return the digest. We still have two left to be queried. If read Repair is not enabled, they will not be touched for this request. Otherwise, these two will be asked to compute digest. The request to the last two nodes is done in the background, after returning the result. This updates all the nodes with the most recent value, making all replicas consistent. So, basically, in all scenarios, you will have a maximum one wrong response. But with correct read and write consistency levels, we can guarantee an up-to-date response all the time. Let's see what goes within a node. Take a simple case of a read request looking for a single column within a single row. First, the attempt is made to read from MemTable, which is rapid-fast since there exists only one copy of data. This is the fastest retrieval. If the data is not found there, Cassandra looks into SSTable. Now, remember from our earlier discussion that we flush MemTables to disk as SSTables and later when compaction mechanism wakes up, it merges those SSTables. So, our data can be in multiple SSTables. Figure 2.7: A simplified representation of the read mechanism. The bottom image shows processing on the read node. Numbers in circles shows the order of the event. BF stands for Bloom Filter Each SSTable is associated with its Bloom Filter built on the row keys in the SSTable. Bloom Filters are kept in memory, and used to detect if an SSTable may contain (false positive) the row data. Now, we have the SSTables that may contain the row key. The SSTables get sorted in reverse chronological order (latest first). Apart from Bloom Filter for row keys, there exists one Bloom Filter for each row in the SSTable. This secondary Bloom Filter is created to detect whether the requested column names exist in the SSTable. Now, Cassandra will take SSTables one by one from younger to older. And use the index file to locate the offset for each column value for that row key and the Bloom filter associated with the row (built on the column name). On Bloom filter being positive for the requested column, it looks into the SSTable file to read the column value. Note that we may have a column value in other yet-to-be-read SSTables, but that does not matter, because we are reading the most recent SSTables first, and any value that was written earlier to it does not matter. So, the value gets returned as soon as the first column in the most recent SSTable is allocated. Components of Cassandra We have gone through how read and write takes place in highly distributed Cassandra clusters. It's time to look into individual components of it a little deeper. Messaging service Messaging service is the mechanism that manages internode socket communication in a ring. Communications, for example, gossip, read, read digest, write, and so on, processed via a messaging service, can be assumed as a gateway messaging server running at each node. To communicate, each node creates two socket connections per node. This implies that if you have 101 nodes, there will be 200 open sockets on each node to handle communication with other nodes. The messages contain a verb handler within them that basically tells the receiving node a couple of things: how to deserialize the payload message and what handler to execute for this particular message. The execution is done by the verb handlers (sort of an event handler). The singleton that orchestrates the messaging service mechanism is org.apache.cassandra.net.MessagingService. Gossip Cassandra uses the gossip protocol for internode communication. As the name suggests, the protocol spreads information in the same way an office rumor does. It can also be compared to a virus spread. There is no central broadcaster, but the information (virus) gets transferred to the whole population. It's a way for nodes to build the global map of the system with a small number of local interactions. Cassandra uses gossip to find out the state and location of other nodes in the ring (cluster). The gossip process runs every second and exchanges information with at the most three other nodes in the cluster. Nodes exchange information about themselves and other nodes that they come to know about via some other gossip session. This causes all the nodes to eventually know about all the other nodes. Like everything else in Cassandra, gossip messages have a version number associated with it. So, whenever two nodes gossip, the older information about a node gets overwritten with a newer one. Cassandra uses an Anti-entropy version of gossip protocol that utilizes Merkle trees (discussed later) to repair unread data. Implementation-wise the gossip task is handled by the org.apache.cassandra.gms.Gossiper class. Gossiper maintains a list of live and dead endpoints (the unreachable endpoints). At every one-second interval, this module starts a gossip round with a randomly chosen node. A full round of gossip consists of three messages. A node X sends a syn message to a node Y to initiate gossip. Y, on receipt of this syn message, sends an ack message back to X. To reply to this ack message, X sends an ack2 message to Y completing a full message round. Figure 2.8: Two nodes gossiping Failure detection Failure detection is one of the fundamental features of any robust and distributed system. A good failure detection mechanism implementation makes a fault-tolerant system such as Cassandra. The failure detector that Cassandra uses is a variation of The ϕ accrual failure detector (2004) by Xavier Défago et al. (The phi accrual detector research paper is available at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.3350.) The idea behind a FailureDetector is to detect a communication failure and take appropriate actions based on the state of the remote node. Unlike traditional failure detectors, phi accrual failure detector does not emit a Boolean alive or dead (true or false, trust or suspect) value. Instead, it gives a continuous value to the application and the application is left to decide the level of severity and act accordingly. This continuous suspect value is called phi (ϕ). Partitioner Cassandra is a distributed database management system. This means it takes a single logical database and distributes it over one or more machines in the database cluster. So, when you insert some data in Cassandra, it assigns each row to a row key; and based on that row key, Cassandra assigns that row to one of the nodes that's responsible for managing it. Let's try to understand this. Cassandra inherits the data model from Google's BigTable (BigTable research paper can be found at http://research.google.com/archive/bigtable.html.). This means we can roughly assume that the data is stored in some sort of a table that has an unlimited number of columns (not unlimited, Cassandra limits the maximum number of columns to be two billion) with rows binded with a unique key, namely, row key. Now, your terabytes of data on one machine will be restrictive from multiple points of views. One is disk space, another being limited parallel processing, and if not duplicated, a source of single point of failure. What Cassandra does is, it defines some rules to slice data across rows and assigns which node in the cluster is responsible for holding which slice. This task is done by a partitioner. There are several types of partitioners to choose from. In short, Cassandra (as of Version 1.2) offers three partitioners as follows: RandomPartitioner: It uses MD5 hashing to distribute data across the cluster. Cassandra 1.1.x and precursors have this as the default partitioner. Murmur3Partitioner: It uses Murmur hash to distribute the data. It performs better than RandomPartitioner. It is the default partitioner from Cassandra Version 1.2 onwards. ByteOrderPartitioner: Keeps keys distributed across the cluster by key bytes. This is an ordered distribution, so the rows are stored in lexical order. This distribution is commonly discouraged because it may cause a hotspot. Replication Cassandra runs on commodity hardware, and works reliably in network partitions. However, this comes with a cost: replication. To avoid data inaccessibility in case a node goes down or becomes unavailable, one must replicate data to more than one node. Replication brings features such as fault tolerance and no single point of failure to the system. Cassandra provides more than one strategy to replicate the data, and one can configure the replication factor while creating keyspace. Log Structured Merge tree Cassandra (also, HBase) is heavily influenced by Log Structured Merge (LSM). It uses an LSM tree-like mechanism to store data on a disk. The writes are sequential (in append fashion) and the data storage is contiguous. This makes writes in Cassandra superfast, because there is no seek involved. Contrast this with an RBDMS system that is based on the B+ Tree (http://en.wikipedia.org/wiki/B%2B_tree) implementation. LSM tree advocates the following mechanism to store data: note down the arriving modification into a log file (CommitLog), push the modification/new data into memory (MemTable) for faster lookup, and when the system has gathered enough updates in memory or after a certain threshold time, flush this data to a disk in a structured store file (SSTable). The logs corresponding to the updates that are flushed can now be discarded. Figure 2.12: Log Structured Merge (LSM) Trees CommitLog One of the promises that Cassandra makes to the end users is durability. In conventional terms (or in ACID terminology), durability guarantees that a successful transaction (write, update) will survive permanently. This means once Cassandra says write successful that means the data is persisted and will survive system failures. It is done the same way as in any DBMS that guarantees durability: by writing the replayable information to a file before responding to a successful write. This log is called the CommitLog in the Cassandra realm. This is what happens. Any write to a node gets tracked by org.apache.cassandra.db.commitlog.CommitLog, which writes the data with certain metadata into the CommitLog file in such a manner that replaying this will recreate the data. The purpose of this exercise is to ensure there is no data loss. If due to some reason the data could not make it into MemTable or SSTable, the system can replay the CommitLog to recreate the data. MemTable MemTable is an in-memory representation of column family. It can be thought of as a cached data. MemTable is sorted by key. Data in MemTable is sorted by row key. Unlike CommitLog, which is append-only, MemTable does not contain duplicates. A new write with a key that already exists in the MemTable overwrites the older record. This being in memory is both fast and efficient. The following is an example: Write 1: {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} In CommitLog (new entry, append): {k1: [{c1, v1},{c2, v2}, {c3, v3}]} In MemTable (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} Write 2: {k2: [{c4, v4}]} In CommitLog (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} In MemTable (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} Write 3: {k1: [{c1, v5}, {c6, v6}]} In CommitLog (old entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} {k1: [{c1, v5}, {c6, v6}]} In MemTable (old entry, update): {k1: [{c1, v5}, {c2, v2}, {c3, v3}, {c6, v6}]} {k2: [{c4, v4}]} Cassandra Version 1.1.1 uses SnapTree (https://github.com/nbronson/snaptree) for MemTable representation, which claims it to be "... a drop-in replacement for ConcurrentSkipListMap, with the additional guarantee that clone() is atomic and iteration has snapshot isolation." See also copy-on-write and compare-and-swap (http://en.wikipedia.org/wiki/Copy-on-write, http://en.wikipedia.org/wiki/Compare-and-swap). Any write gets written first to CommitLog and then to MemTable. SSTable SSTable is a disk representation of the data. MemTables gets flushed to disk to immutable SSTables. All the writes are sequential, which makes this process fast. So, the faster the disk speed, the quicker the flush operation. The SSTables eventually get merged in the compaction process and the data gets organized properly into one file. This extra work in compaction pays off during reads. SSTables have three components: Bloom filter, index files, and datafiles. Bloom filter Bloom filter is a litmus test for the availability of certain data in storage (collection). But unlike a litmus test, a Bloom filter may result in false positives: that is, it says that a data exists in the collection associated with the Bloom filter, when it actually does not. A Bloom filter never results in a false negative. That is, it never states that a data is not there while it is. The reason to use Bloom filter, even with its false-positive defect, is because it is superfast and its implementation is really simple. Cassandra uses Bloom filters to determine whether an SSTable has the data for a particular row key. Bloom filters are unused for range scans, but they are good candidates for index scans. This saves a lot of disk I/O that might take in a full SSTable scan, which is a slow process. That's why it is used in Cassandra, to avoid reading many, many SSTables, which can become a bottleneck. Index files Index files are companion files of SSTables. The same as Bloom filter, there exists one index file per SSTable. It contains all the row keys in the SSTable and its offset at which the row starts in the datafile. At startup, Cassandra reads every 128th key (configurable) into the memory (sampled index). When the index is looked for a row key (after Bloom filter hinted that the row key might be in this SSTable), Cassandra performs a binary search on the sampled index in memory. Followed by a positive result from the binary search, Cassandra will have to read a block in the index file from the disk starting from the nearest value lower than the value that we are looking for. Datafiles Datafiles are the actual data. They contain row keys, metadata, and columns (partial or full). Reading data from datafiles is just one disk seek followed by a sequential read, as offset to a row key is already obtained from the associated index file. Compaction As mentioned earlier, a read require may require Cassandra to read across multiple SSTables to get a result. This is wasteful, costs multiple (disk) seeks, may require a conflict resolution, and if there are too many, SSTables were created. To handle this problem, Cassandra has a process in place, namely, compaction. Compaction merges multiple SSTable files into one. Off the shelf, Cassandra offers two types of compaction mechanism: Size Tiered Compaction Strategy and Level Compaction Strategy. The compaction process starts when the number of SSTables on disk reaches a certain threshold (N: configurable). Although the merge process is a little I/O intensive, it benefits in the long term with a lower number of disk seeks during reads. Apart from this, there are a few other benefits of compaction as follows: Removal of expired tombstones (Cassandra v0.8+) Merging row fragments Rebuilds primary and secondary indexes Tombstones Cassandra is a complex system with its data distributed among CommitLogs, MemTables, and SSTables on a node. The same data is then replicated over replica nodes. So, like everything else in Cassandra, deletion is going to be eventful. Deletion, to an extent, follows an update pattern except Cassandra tags the deleted data with a special value, and marks it as a tombstone. This marker helps future queries, compaction, and conflict resolution. Let's step further down and see what happens when a column from a column family is deleted. A client connected to a node (coordinator node, but it may not be the one holding the data that we are going to mutate), issues a delete command for a column C, in a column family CF. If the consistency level is satisfied, the delete command gets processed. When a node, containing the row key, receives a delete request, it updates or inserts the column in MemTable with a special value, namely, tombstone. The tombstone basically has the same column name as the previous one; the value is set to UNIX epoch. The timestamp is set to what the client has passed. When a MemTable is flushed to SSTable, all tombstones go into it as any regular column. On the read side, when the data is read locally on the node and it happens to have multiple versions of it in different SSTables, they are compared and the latest value is taken as the result of reconciliation. If a tombstone turns out to be a result of reconciliation, it is made a part of the result that this node returns. So, at this level, if a query has a deleted column, this exists in the result. But the tombstones will eventually be filtered out of the result before returning it back to the client. So, a client can never see a value that is a tombstone. Hinted handoff When we last talked about durability, we observed Cassandra provides CommitLogs to provide write durability. This is good. But what if the node, where the writes are going to be, is itself dead? No communication will keep anything new to be written to the node. Cassandra, inspired by Dynamo, has a feature called hinted handoff. In short, it's the same as taking a quick note locally that X cannot be contacted. Here is the mutation (operation that requires modification of the data such as insert, delete, and update) M that will be required to be replayed when it comes back. The coordinator node (the node which the client is connected to) on receipt of a mutation/write request, forwards it to appropriate replicas that are alive. If this fulfills the expected consistency level, write is assumed successful. The write requests to a node that does not respond to a write request or is known to be dead (via gossip) and is stored locally in the system.hints table. This hint contains the mutation. When a node comes to know via gossip that a node is recovered, it replays all the hints it has in store for that node. Also, every 10 minutes, it keeps checking any pending hinted handoffs to be written. Why worry about hinted handoff when you have written to satisfy consistency level? Wouldn't it eventually get repaired? Yes, that's right. Also, hinted handoff may not be the most reliable way to repair a missed write. What if the node that has hinted handoff dies? This is a reason why we do not count on hinted handoff as a mechanism to provide consistency (except for the case of the consistency level, ANY) guarantee; it's a single point of failure. The purpose of hinted handoff is, one, to make restored nodes quickly consistent with the other live ones; and two, to provide extreme write availability when consistency is not required. Read repair and Anti-entropy Cassandra promises eventual consistency and read repair is the process which does that part. Read repair, as the name suggests, is the process of fixing inconsistencies among the replicas at the time of read. What does that mean? Let's say we have three replica nodes A, B, and C that contain a data X. During an update, X is updated to X1 in replicas A and B. But it is failed in replica C for some reason. On a read request for data X, the coordinator node asks for a full read from the nearest node (based on the configured Snitch) and digest of data X from other nodes to satisfy consistency level. The coordinator node compares these values (something like digest(full_X) == digest_from_node_C). If it turns out that the digests are the same as the digests of full read, the system is consistent and the value is returned to the client. On the other hand, if there is a mismatch, full data is retrieved and reconciliation is done and the client is sent the reconciled value. After this, in background, all the replicas are updated with the reconciled value to have a consistent view of data on each node. See Figure 2.1 as shown: Figure 2.16: Image showing read repair dynamics. 1. Client queries for data x, from a node C (coordinator). 2. C gets data from replicas R1, R2, and R3; reconciles. 3. Sends reconciled data to client. 4. If there is a mismatch across replicas, repair is invoked. Merkle tree Merkle tree (A digital signature Based On A Conventional Encryption Function by Merkle, R. (1988), available at http://www.cse.msstate.edu/~ramkumar/merkle2.pdf.) is a hash tree where leaves of the tree hashes hold actual data in a column family and non-leaf nodes hold hashes of their children. The unique advantage of Merkle tree is a whole subtree can be validated just by looking at the value of the parent node. So, if nodes on two replica servers have the same hash values, then the underlying data is consistent and there is no need to synchronize. If one node passes the whole Merkle tree of a column family to another node, it can determine all the inconsistencies. Figure 2.17: Merkle tree to determine mismatch in hash values at parent nodes due to the difference in underlying data Summary By now, you are familiar with all the nuts and bolts of Cassandra. It is understandable that it may be a lot to take in for someone new to NoSQL systems. It is okay if you do not have complete clarity at this point. As you start working with Cassandra, tweaking it, experimenting with it, and going through the Cassandra mailing list discussions or talks, you will start to come across stuff that you have read in this article and it will start to make sense, and perhaps you may want to come back and refer to this article to improve clarity. It is not required to understand this article fully to be able to write queries, set up clusters, maintain clusters, or do anything else related to Cassandra. A general sense of this article will take you far enough to work extremely well with Cassandra-based projects. Resources for Article: Further resources on this subject: Apache Cassandra: Libraries and Applications [Article] Apache Felix Gogo [Article] Migration from Apache to Lighttpd [Article]

0
0
3608

article-image-creating-interactive-graphics-and-animation

Packt

02 Jan 2013

15 min read

Creating Interactive Graphics and Animation

Packt

02 Jan 2013

15 min read

(For more resources related to this topic, see here.) Interactive graphics and animations This article showcases MATLAB's capabilities for creating interactive graphics and animations. A static graphic is essentially two dimensional. The ability to rotate the axes and change the view, add annotations in real time, delete data, and zoom in or zoom out adds significantly to the user experience, as the brain is able to process and see more from that interaction. MATLAB supports interactivity with the standard zoom, pan features, a powerful set of camera tools to change the data view, data brushing, and axes linking. The set of functionalities accessible from the figure and camera toolbars are outlined briefly as follows: The steps of interactive exploration can also be recorded and presented as an animation. This is very useful to demonstrate the evolution of the data in time or space or along any dimension where sequence has meaning. Note that some recipes in this article may require you to run the code from the source code files as a whole unit because they were developed as functions. As functions, they are not independently interpretable using the separate code blocks corresponding to each step. Callback functions A mouse drag movement from the top-left corner to bottom-right corner is commonly used for zooming in or selecting a group of objects. You can also program a custom behavior to such an interaction event, by using a callback function. When a specific event occurs (for example, you click on a push button or double-click with your mouse), the corresponding callback function executes. Many event properties of graphics handle objects can be used to define callback functions. In this recipe, you will write callback functions which are essential to implement a slider element to get input from the user on where to create the slice or an isosurface for 3D exploration. You will also see options available to share data between the calling and callback functions. Getting started Load the dataset. Split the data into two main sets—userdataA is a structure with variables related to the demographics and userdataB is a structure with variables related to the Income Groups. Now create a nested structure with these two data structures as shown in the following code snippet: load customCountyData userdataA.demgraphics = demgraphics; userdataA.lege = lege; userdataB.incomeGroups = incomeGroups; userdataB.crimeRateLI = crimeRateLI; userdataB.crimeRateHI = crimeRateHI; userdataB.crimeRateMI = crimeRateMI; userdataB.AverageSATScoresLI = AverageSATScoresLI; userdataB.AverageSATScoresMI = AverageSATScoresMI; userdataB.AverageSATScoresHI = AverageSATScoresHI; userdataB.icleg = icleg; userdataAB.years = years; userdataAB.userdataA = userdataA; userdataAB.userdataB = userdataB; How to do it... Perform the following steps: Run this as a function at the console: c3165_07_01_callback_functions A figure is brought up with a non-standard menu item as highlighted in the following screenshot. Select the By Population item: Here is the resultant figure: Continue to explore the other options to fully exercise the interactivity built into this graphic. How it works... The function c3165_07_01_callback_functions works as follows: A custom menu item Data Groups is created, with additional submenu items—By population, By Income Groups, or Show all. % add main menu item f = uimenu('Label','Data Groups'); % add sub menu items with additional parameters uimenu(f,'Label','By Population','Callback','showData',... 'tag','demographics','userdata',userdataAB); uimenu(f,'Label','By IncomeGroups',... 'Callback','showData','tag','IncomeGroups',... 'userdata',userdataAB); uimenu(f,'Label','ShowAll','Callback','showData',... 'tag','together','userdata',userdataAB); You defined the tag name and the callback function for each submenu item above. Having a tag name makes it easier to use the same callback function with multiple objects because you can query the tag name to find out which object initiated the call to the callback function (if you need that information). In this example, the callback function behavior is dependent upon which submenu item was selected. So the tag property allowed you to use the single function showData as callback for all three submenu items and still implement submenu item specific behavior. Alternately, you could also register three different callback functions and use no tag names. You can specify the value of a callback property in three ways. Here, you gave it a function handle. Alternately, you can supply a string that is a MATLAB command that executes when the callback is invoked. Or, a cell array with the function handle and additional arguments as you will see in the next section. For passing data between the calling and callback function, you also have three options. Here, you set the userdata property to the variable name that has the data needed by the callback function. Note that the userdata is just one variable and you passed a complicated data structure as userdata to effectively pass multiple values. The user data can be extracted from within the callback function of the object or menu item whose callback is executing as follows: userdata = get(gcbo,'userdata'); The second alternative to pass data to callback functions is by means of the application data. This does not require you to build a complicated data structure. Depending on how much data you need to pass, this later option may be the faster mechanism. It also has the advantage that the userdata space cannot inadvertently get overwritten by some other function. Use the setappdata function to pass multiple variables. In this recipe, you maintained the main drawing area axis handles and the custom legend axis handles as application data. setappdata(gcf,'mainAxes',[]); setappdata(gcf,'labelAxes',[]); This was retrieved each time within the executing callback functions, to clear the graphic as new choices are selected by the user from the custom menu. mainAxesHandle = getappdata(gcf,'mainAxes'); labelAxesHandles = getappdata(gcf,'labelAxes'); if ~isempty(mainAxesHandle), cla(mainAxesHandle); [mainAxesHandle, x, y, ci, cd] = ... redrawGrid(userdata.years, mainAxesHandle); else [mainAxesHandle, x, y, ci, cd] = ... redrawGrid(userdata.years); end if ~isempty(labelAxesHandles) for ij = 1:length(labelAxesHandles) cla(labelAxesHandles(ij)); end end The third option to pass data to callback functions is at the time of defining the callback property, where you can supply a cell array with the function handle and additional arguments as you will see in the next section. These are local copies of data passed onto the function and will not affect the global values of the variables. The callback function showData is given below. Functions that you want to use as function handle callbacks must define at least two input arguments in the function definition: the handle of the object generating the callback (the source of the event), the event data structure (can be empty for some callbacks). function showData(src, evt) userdata = get(gcbo,'userdata'); if strcmp(get(gcbo,'tag'),'demographics') % Call grid f drawing code block % Call showDemographics with relevant inputs elseif strcmp(get(gcbo,'tag'),'IncomeGroups') % Call grid drawing code block % Call showIncomeGroups with relevant inputs else % Call grid drawing code block % Call showDemographics with relevant inputs % Call showIncomeGroups with relevant inputs end function labelAxesHandle = ... showDemographics(userdata, mainAxesHandle, x, y, cd) % Function specific code end function labelAxesHandle = ... showIncomeGroups(userdata, mainAxesHandle, x, y, ci) % Function specific code end function [mainAxesHandle x y ci cd] = ... redrawGrid(years, mainAxesHandle) % Grid drawing function specific code end end There's more... This section demonstrates the third option to pass data to callback functions by supplying a cell array with the function handle and additional arguments at the time of defining the callback property. Add a fourth submenu item as follows (uncomment line 45 of the source code): uimenu(f,'Label',... 'Alternative way to pass data to callback',... 'Callback',{@showData1,userdataAB},'tag','blah'); Define the showData1 function as follows (uncomment lines 49 to 51 of the source code): function showData1(src, evt, arg1) disp(arg1.years); end Execute the function and see that the value of the years variable are displayed at the MATLAB console when you select the last submenu Alternative way to pass data to callback option. Takeaways from this recipe: Use callback functions to define custom responses for each user interaction with your graphic Use one of the three options for sharing data between calling and callback functions—pass data as arguments with the callback definition, or via the user data space, or via the application data space, as appropriate See also Look up MATLAB help on the setappdata, getappdata, userdata property, callback property, and uimenu commands. Obtaining user input from the graph User input may be desired for annotating data in terms of adding a label to one or more data points, or allowing user settable boundary definitions on the graphic. This recipe illustrates how to use MATLAB to support these needs. Getting started The recipe shows a two-dimensional dataset of intensity values obtained from two different dye fluorescence readings. There are some clearly identifiable clusters of points in this 2D space. The user is allowed to draw boundaries to group points and identify these clusters. Load the data: load clusterInteractivData The imellipse function from the MATLAB image processing toolboxTM is used in this recipe. Trial downloads are available from their website. How to do it... The function constitutes the following steps: Set up the user data variables to share the data between the callback functions of the push button elements in this graph: userdata.symbChoice = {'+','x','o','s','^'}; userdata.boundDef = []; userdata.X = X; userdata.Y = Y; userdata.Calls = ones(size(X)); set(gcf,'userdata',userdata); Make the initial plot of the data: plot(userdata.X,userdata.Y,'k.','Markersize',18); hold on; Add the push button elements to the graphic: uicontrol('style','pushbutton',... 'string','Add cluster boundaries?', ... 'Callback',@addBound, ... 'Position', [10 21 250 20],'fontsize',12); uicontrol('style','pushbutton', ... 'string','Classify', ... 'Callback',@classifyPts, ... 'Position', [270 21 100 20],'fontsize',12); uicontrol('style','pushbutton', ... 'string','Clear Boundaries', ... 'Callback',@clearBounds, ... 'Position', [380 21 150 20],'fontsize',12); Define callback for each of the pushbutton elements. The addBound function is for defining the cluster boundaries. The steps are as follows: % Retrieve the userdata data userdata = get(gcf,'userdata'); % Allow a maximum of four cluster boundary definitions if length(userdata.boundDef)>4 msgbox('A maximum of four clusters allowed!'); return; end % Allow user to define a bounding curve h=imellipse(gca); % The boundary definition is added to a cell array with % each element of the array storing the boundary def. userdata.boundDef{length(userdata.boundDef)+1} = ... h.getPosition; set(gcf,'userdata',userdata); The classifyPts function draws points enclosed in a given boundary with a unique symbol per boundary definition. The logic used in this classification function is simple and will run into difficulties with complex boundary definitions. However, that is ignored as that is not the focus of this recipe. Here, first find points whose coordinates lie in the range defined by the coordinates of the boundary definition. Then, assign a unique symbol to all points within that boundary: for i = 1:length(userdata.boundDef) pts = ... find( (userdata.X>(userdata.boundDef{i}(:,1)))& ... (userdata.X<(userdata.boundDef{i}(:,1)+ ... userdata.boundDef{i}(:,3))) &... (userdata.Y>(userdata.boundDef{i}(:,2)))& ... (userdata.Y<(userdata.boundDef{i}(:,2)+ ... userdata.boundDef{i}(:,4)))); userdata.Calls(pts) = i; plot(userdata.X(pts),userdata.Y(pts), ... [userdata.colorChoice{i} '.'], ... 'Markersize',18); hold on; end The clearBounds function clears the drawn boundaries and removes the clustering based upon those boundary definitions. function clearBounds(src, evt) cla; userdata = get(gcf,'userdata'); userdata.boundDef = []; set(gcf,'userdata',userdata); plot(userdata.X,userdata.Y,'k.','Markersize',18); hold on; end Run the code and define cluster boundaries using the mouse. Note that until you click the on the Classify button, classification does not occur. Here is a snapshot of how it looks (the arrow and dashed boundary is used to depict the cursor movement from user interaction): Initiate a classification by clicking on Classify. The graph will respond by re-drawing all points inside the constructed boundary with a specific symbol: How it works... This recipe illustrates how user input is obtained from the graphical display in order to impact the results produced. The image processing toolbox has several such functions that allow user to provide input by mouse clicks on the graphical display—such as imellipse for drawing elliptical boundaries, and imrect for drawing rectangular boundaries. You can refer to the product pages for more information. Takeaways from this recipe: Obtain user input directly via the graph in terms of data point level annotations and/or user settable boundary definitions See also Look up MATLAB help on the imlineimpoly, imfreehandimrect, and imelli pseginput commands. Linked axes and data brushing MATLAB allows creation of programmatic links between the plot and the data sources and linking different plots together. This feature is augmented by support for data brushing, which is a way to select data and mark it up to distinguish from others. Linking plots to their data source allows you to manipulate the values in the variables and have the plot automatically get updated to reflect the changes. Linking between axes enables actions such as zoom or pan to simultaneously affect the view in all linked axes. Data brushing allows you to directly manipulate the data on the plot and have the linked views reflect the effect of that manipulation and/or selection. These features can provide a live and synchronized view of different aspects of your data. Getting ready You will use the same cluster data as the previous recipe. Each point is denoted by an x and y value pair. The angle of each point can be computed as the inverse tangent of the ratio of the y value to the x value. The amplitude of each point can be computed as the square root of the sum of squares of the x and y values. The main panel in row 1 show the data in a scatter plot. The two plots in the second row have the angle and amplitude values of each point respectively. The fourth and fifth panels in the third row are histograms of the x and y values respectively. Load the data and calculate the angle and amplitude data as described earlier: load clusterInteractivData data(:,1) = X; data(:,2) = Y; data(:,3) = atan(Y./X); data(:,4) = sqrt(X.^2 + Y.^2); clear X Y How to do it... Perform the following steps: Plot the raw data: axes('position',[.3196 .6191 .3537 .3211], ... 'Fontsize',12); scatter(data(:,1), data(:,2),'ks', ... 'XDataSource','data(:,1)','YDataSource','data(:,2)'); box on; xlabel('Dye 1 Intensity'); ylabel('Dye 1 Intensity');title('Cluster Plot'); Plot the angle data: axes('position',[.0682 .3009 .4051 .2240], ... 'Fontsize',12); scatter(1:length(data),data(:,3),'ks',... 'YDataSource','data(:,3)'); box on; xlabel('Serial Number of Points'); title('Angle made by each point to the x axis'); ylabel('tan^{-1}(Y/X)'); Plot the amplitude data: axes('position',[.5588 .3009 .4051 .2240], ... 'Fontsize',12); scatter(1:length(data),data(:,4),'ks', ... 'YDataSource','data(:,4)'); box on; xlabel('Serial Number of Points'); title('Amplitude of each point'); ylabel('{surd(X^2 + Y^2)}'); Plot the two histograms: axes('position',[.0682 .0407 .4051 .1730], ... 'Fontsize',12); hist(data(:,1)); title('Histogram of Dye 1 Intensities'); axes('position',[.5588 .0407 .4051 .1730], ... 'Fontsize',12); hist(data(:,2)); title('Histogram of Dye 2 Intensities'); The output is as follows: Programmatically, link the data to their source: linkdata; Programmatically, turn brushing on and set the brush color to green: h = brush; set(h,'Color',[0 1 0],'Enable','on'); Use mouse movements to brush a set of points. You could do this on any one of the first three panels and observe the impact on corresponding points in the other graphs by its turning green. (The arrow and dashed boundary is used to depict the cursor movement from user interaction in the following figure): How it works... Because brushing is turned on, when you focus the mouse on any of the graph areas, a cross hair shows up at the cursor. You can drag to select an area of the graph. Points falling within the selected area are brushed to the color green, for the graphs on rows 1 and 2. Note that nothing is highlighted on the histograms at this point. This is because the x and y data source for the histograms is not correctly linked to the data source variables yet. For the other graphs, you programmatically set their x and y data source via the XDataSource and the YDataSource properties. You can also define the source data variables to link to a graphic and turn brushing on by using the icons from the figure toolbar as shown in the following screenshot. The first circle highlights the brush button; the second circle highlights the link data button. You can click on the Edit link pointed by the arrow to exactly define the x and y sources: There's more... To define the source data variables to link to a graphic and turn brushing on by using the icons from the Figure toolbar, do as follows: Clicking on Edit (pointed to in preceding figure) will bring up the following window: Enter data(:,1) in the YDataSource column for row 1 and data(:,2) in the YDataSource column for row 2. Now try brushing again. Observe that bins of the histogram get highlights in a bottom up order as corresponding points get selected (again, the arrow and dashed boundary is used to depict the cursor movement from user interaction): Link axes together to simultaneously investigate multiple aspects of the same data point. For example, in this step you plot the cluster data alongside a random quality value for each point of the data. Link the axes such that zoom and pan functions on either will impact the axes of the other linked axes: axes('position',[.13 .11 .34 .71]); scatter(data(:,1), data(:,2),'ks');box on; axes('position',[.57 .11 .34 .71]); scatter(data(:,1), data(:,2),[],rand(size(data,1),1), ... 'marker','o', 'LineWidth',2);box on; linkaxes; The output is as follows. Experiment with zoom and pan functionalities on this graph. Takeaways from this recipe: Use data brushing and linked axes features to provide a live and synchronized view of different aspects of your data. See also Look up MATLAB help on the linkdata, linkaxes, and brush commands.

0
0
3581

article-image-integrating-solr-ruby-rails-integration

Packt

09 Sep 2010

12 min read

Integrating Solr: Ruby on Rails Integration

Packt

09 Sep 2010

12 min read

(For more resources on Solr, see here.) The classic plugin for Rails is acts_as_solr that allows Rails ActiveRecord objects to be transparently stored in a Solr index. Other popular options include Solr Flare and rsolr. An interesting project is Blacklight, a tool oriented towards libraries putting their catalogs online. While it attempts to meet the needs of a specific market, it also contains many examples of great Ruby techniques to leverage in your own projects. You will need to turn on the Ruby writer type in solrconfig.xml: <queryResponseWriter name="ruby" class="org.apache.solr.request.RubyResponseWriter"/> The Ruby hash structure has some tweaks to fit Ruby, such as translating nulls to nils, using single quotes for escaping content, and the Ruby => operator to separate key-value pairs in maps. Adding a wt=ruby parameter to a standard search request returns results in a Ruby hash structure like this: { 'responseHeader'=>{ 'status'=>0, 'QTime'=>1, 'params'=>{ 'wt'=>'ruby', 'indent'=>'on', 'rows'=>'1', 'start'=>'0', 'q'=>'Pete Moutso'}}, 'response'=>{'numFound'=>523,'start'=>0,'docs'=>[ { 'a_name'=>'Pete Moutso', 'a_type'=>'1', 'id'=>'Artist:371203', 'type'=>'Artist'}]}} acts_as_solr A very common naming pattern for plugins in Rails that manipulate the database backed object model is to name them acts_as_X. For example, the very popular acts_as_list plugin for Rails allows you to add list semantics, like first, last, move_next to an unordered collection of items. In the same manner, acts_as_solr takes ActiveRecord model objects and transparently indexes them in Solr. This allows you to do fuzzy queries that are backed by Solr searches, but still work with your normal ActiveRecord objects. Let's go ahead and build a small Rails application that we'll call MyFaves that both allows you to store your favorite MusicBrainz artists in a relational model and allows you to search for them using Solr. acts_as_solr comes bundled with a full copy of Solr 1.3 as part of the plugin, which you can easily start by running rake solr:start. Typically, you are starting with a relational database already stuffed with content that you want to make searchable. However, in our case we already have a fully populated index available in /examples, and we are actually going to take the basic artist information out of the mbartists index of Solr and populate our local myfaves database with it. We'll then fire up the version of Solr shipped with acts_as_solr, and see how acts_as_solr manages the lifecycle of ActiveRecord objects to keep Solr's indexed content in sync with the content stored in the relational database. Don't worry, we'll take it step by step! The completed application is in /examples/8/myfaves for you to refer to. Setting up MyFaves project We'll start with the standard plumbing to get a Rails application set up with our basic data model: >>rails myfaves>>cd myfaves>>./script/generate scaffold artist name:string group_type:string release_date:datetime image_url:string>>rake db:migrate This generates a basic application backed by an SQLite database. Now we need to install the acts_as_solr plugin. acts_as_solr has gone through a number of revisions, from the original code base done by Erik Hatcher and posted to the solr-user mailing list in August of 2006, which was then extended by Thiago Jackiw and hosted on Rubyforge. Today the best version of acts_as_solr is hosted on GitHub by Mathias Meyer at http://github.com/ mattmatt/acts_as_solr/tree/master. The constant migration from one site to another leading to multiple possible 'best' versions of a plugin is unfortunately a very common problem with Rails plugins and projects, though most are settling on either RubyForge.org or GitHub.com. In order to install the plugin, run: >>script/plugin install git://github.com/mattmatt/acts_as_solr.gitt We'll also be working with roughly 399,000 artists, so obviously we'll need some page pagination to manage that list, otherwise pulling up the artists /index listing page will timeout: >>script/plugin install git://github.com/mislav/will_paginate.git Edit the ./app/controllers/artists_controller.rb file, and replace in the index method the call to @artists = Artist.find(:all) with: @artists = Artist.paginate :page => params[:page], :order => 'created_at DESC' Also add to ./app/views/artists/index.html.erb a call to the view helper to generate the page links: <%= will_paginate @artists %> Start the application using ./script/server, and visit the page http://localhost:3000/artists/. You should see an empty listing page for all of the artists. Now that we know the basics are working, let's go ahead and actually leverage Solr. Populating MyFaves relational database from Solr Step one will be to import data into our relational database from the mbartists Solr index. Add the following code to ./app/models/artist.rb: class Artist < ActiveRecord::Base acts_as_solr :fields => [:name, :group_type, :release_date]end The :fields array of hashes maps the attributes of the Artist ActiveRecord object to the artist fields in Solr's schema.xml. Because acts_as_solr is designed to store data in Solr that is mastered in your data model, it needs a way of distinguishing among various types of data model objects. For example, if we wanted to store information about our User model object in Solr in addition to the Artist object then we need to provide a type_field to separate the Solr documents for the artist with the primary key of 5 from the user with the primary key of 5. Fortunately the mbartists schema has a field named type that stores the value Artist, which maps directly to our ActiveRecord class name of Artist and we are able to use that instead of the default acts_as_solr type field in Solr named type_s. There is a simple script called populate.rb at the root of /examples/8/myfaves that you can run that will copy the artist data from the existing Solr mbartists index into the MyFaves database: >>ruby populate.rb populate.rb is a great example of the types of scripts you may need to develop to transfer data into and out of Solr. Most scripts typically work with some sort of batch size of records that are pulled from one system and then inserted into Solr. The larger the batch size, the more efficient the pulling and processing of data typically is at the cost of more memory being consumed, and the slower the commit and optimize operations are. When you run the populate.rb script, play with the batch size parameter to get a sense of resource consumption in your environment. Try a batch size of 10 versus 10000 to see the changes. The parameters for populate.rb are available at the top of the script: MBARTISTS_SOLR_URL = 'http://localhost:8983/solr/mbartists'BATCH_SIZE = 1500MAX_RECORDS = 100000 # the maximum number of records to load, or nil for all There are roughly 399,000 artists in the mbartists index, so if you are impatient, then you can set MAX_RECORDS to a more reasonable number. The process for connecting to Solr is very simple with a hash of parameters that are passed as part of the GET request. We use the magic query value of *:* to find all of the artists in the index and then iterate through the results using the start parameter: connection = Solr::Connection.new(MBARTISTS_SOLR_URL) solr_data = connection.send(Solr::Request::Standard.new({ :query => '*:*', :rows=> BATCH_SIZE, :start => offset, :field_list =>['*','score'] })) In order to create our new Artist model objects, we just iterate through the results of solr_data. If solr_data is nil, then we exit out of the script knowing that we've run out of results. However, we do have to do some parsing translation in order to preserve our unique identifiers between Solr and the database. In our MusicBrainz Solr schema, the ID field functions as the primary key and looks like Artist:11650 for The Smashing Pumpkins. In the database, in order to sync the two, we need to insert the Artist with the ID of 11650. We wrap the insert statement a.save! in a begin/rescue/end structure so that if we've already inserted an artist with a primary key, then the script continues. This just allows us to run the populate script multiple times: solr_data.hits.each do |doc| id = doc["id"] id = id[7..(id.length)] a = Artist.new(:name => doc["a_name"], :group_type => a["a_type"], :release_date => doc["a_release_date_latest"]) a.id = id begin a.save! rescue ActiveRecord::StatementInvalid => ar_si raise ar_si unless ar_si.to_s.include?("PRIMARY KEY must be unique") #sink duplicates endend Now that we've transferred the data out of our mbartists index and used acts_as_solr according to the various conventions that it expects, we'll change from using the mbartists Solr instance to the version of Solr shipped with acts_as_solr. Solr related configuration information is available in ./myfaves/config/solr.xml. Ensure that the default development URL doesn't conflict with any existing Solr's you may be running: development: url: http://127.0.0.1:8982/solr Start the included Solr by running rake solr:start. When it starts up, it will report the process ID for Solr running in the background. If you need to stop the process, then run the corresponding rake task: rake solr:stop. The empty new Solr indexes are stored in ./myfaves/solr/development. Build Solr indexes from relational database Now we are ready to trigger a full index of the data in the relational database into Solr. acts_as_solr provides a very convenient rake task for this with a variety of parameters that you can learn about by running rake -D solr:reindex. We'll specify to work with a batch size of 1500 artists at a time: >>rake solr:start>>% rake solr:reindex BATCH=1500(in /examples/8/myfaves)Clearing index for Artist...Rebuilding index for Artist...Optimizing... This drastic simplification of configuration in the Artist model object is because we are using a Solr schema that is designed to leverage the Convention over Configuration ideas of Rails. Some of the conventions that are established by acts_as_solr and met by Solr are: Primary key field for model object in Solr is always called pk_i. Type field that stores the disambiguating class name of the model object is called type_s. Heavy use of the dynamic field support in Solr. The data type of ActiveRecord model objects is based on the database column type. Therefore, when acts_as_solr indexes a model object, it sends a document to Solr with the various suffixes to leverage the dynamic column creation. In /examples/8/myfaves/vendor/plugins/acts_as_solr/solr/solr/conf/ schema.xml, the only fields defined outside of the management fields are dynamic fields: <dynamicField name="*_t" type="text" indexed="true" stored="false"/> The default search field is called text. And all of the fields ending in _t are copied into the text search field. Fields to facet on are named _facet and copied into the text search field as well. The document that gets sent to Solr for our Artist records creates the dynamic fields name_t, group_type_s and release_date_d, for a text, string, and date field respectively. You can see the list of dynamic fields generated through the schema browser at http://localhost:8982/solr/admin/schema.jsp. Now we are ready to perform some searches. acts_as_solr adds some new methods such as find_by_solr() that lets us find ActiveRecord model objects by sending a query to Solr. Here we find the group Smash Mouth by searching for matches to the word smashing: % ./script/consoleLoading development environment (Rails 2.3.2)>> artists = Artist.find_by_solr("smashing")=> #<ActsAsSolr::SearchResults:0x224889c @solr_data={:total=>9, :docs=>[#<Artist id: 364, name: "Smash Mouth"...>> artists.docs.first=> #<Artist id: 364, name: "Smash Mouth", group_type: 1, release_date: "2006-09-19 04:00:00", created_at: "2009-04-17 18:02:37", updated_at: "2009-04-17 18:02:37"> Let's also verify that acts_as_solr is managing the full lifecycle of our objects. Assuming Susan Boyle isn't yet entered as an artist, let's go ahead and create her: >> Artist.find_by_solr("Susan Boyle")=> #<ActsAsSolr::SearchResults:0x26ee298 @solr_data={:total=>0, :docs=>[]}>>> susan = Artist.create(:name => "Susan Boyle", :group_type => 1, :release_date => Date.new)=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1, release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21 13:11:09", updated_at: "2009-04-21 13:11:09"> Check the log output from your Solr running on port 8982, and you should also have seen an update query triggered by the insert of the new Susan Boyle record: INFO: [] webapp=/solr path=/update params={} status=0 QTime=24 Now, if we delete Susan's record from our database: >> susan.destroy=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1, release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21 13:11:09", updated_at: "2009-04-21 13:11:09">=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1, release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21 13:11:09", updated_at: "2009-04-21 13:11:09"> Then there should be another corresponding update issued to Solr to remove the document: INFO: [] webapp=/solr path=/update params={} status=0 QTime=57 You can verify this by doing a search for Susan Boyle directly, which should return no rows at http://localhost:8982/solr/select/?q=Susan+Boyle.

0
0
3574

Packt

07 Apr 2015

9 min read

Work Item Querying

Packt

07 Apr 2015

9 min read

0
0
3573

article-image-understanding-amazon-machine-learning-workflow

Natasha Mathur

24 Aug 2018

11 min read

Understanding Amazon Machine Learning Workflow [ Tutorial ]

Natasha Mathur

24 Aug 2018

11 min read

This article presents an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. Amazon Machine Learning is an online service by Amazon Web Services (AWS) that does supervised learning for predictive analytics. Launched in April 2015 at the AWS Summit, Amazon ML joins a growing list of cloud-based machine learning services, such as Microsoft Azure, Google prediction, IBM Watson, Prediction IO, BigML, and many others. These online machine learning services form an offer commonly referred to as Machine Learning as a Service or MLaaS following a similar denomination pattern of other cloud-based services such as SaaS, PaaS, and IaaS respectively for Software, Platform, or Infrastructure as a Service. The Amazon ML workflow closely follows a standard Data Science workflow with steps: Extract the data and clean it up. Make it available to the algorithm. Split the data into a training and validation set, typically a 70/30 split with equal distribution of the predictors in each part. Select the best model by training several models on the training dataset and comparing their performances on the validation dataset. Use the best model for predictions on new data. This article is an excerpt taken from the book 'Effective Amazon Machine Learning' written by Alexis Perrier. As shown in the following Amazon ML menu, the service is built around four objects: Datasource ML model Evaluation Prediction The Datasource and Model can also be configured and set up in the same flow by creating a new Datasource and ML model. We will take a closer look at the Datasource and ML model. Amazon ML dataset For the rest of the article, we will use the simple Predicting Weight by Height and Age dataset (from Lewis Taylor (1967)) with 237 samples of children's age, weight, height, and gender, which is available at https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm. This dataset is composed of 237 rows. Each row has the following predictors: sex (F, M), age (in months), height (in inches), and we are trying to predict the weight (in lbs) of these children. There are no missing values and no outliers. The variables are close enough in range and normalization is not required. In short, we do not need to carry out any preprocessing or cleaning on the original dataset. Age, height, and weight are numerical variables (real-valued), and sex is a categorical variable. We will randomly select 20% of the rows as the held-out subset to use for the prediction of previously unseen data and keep the other 80% as training and evaluation data. This data split can be done in Excel or any other spreadsheet editor: By creating a new column with randomly generated numbers Sorting the spreadsheet by that column Selecting 190 rows for training and 47 rows for prediction (roughly a 80/20 split) Let us name the training set LT67_training.csv and the held-out set that we will use for prediction LT67_heldout.csv, where LT67 stands for Lewis and Taylor, the creator of this dataset in 1967. Note that it is important for the distribution in age, sex, height, and weight to be similar in both subsets. We want the data on which we will make predictions to show patterns that are similar to the data on which we will train and optimize our model. Loading the data on Amazon S3 Follow these steps to load the training and held-out datasets on S3: Go to your s3 console at https://console.aws.amazon.com/s3. Create a bucket if you haven't done so already. Buckets are basically folders that are uniquely named across all S3. We created a bucket named aml.packt. Since that name has now been taken, you will have to choose another bucket name if you are following along with this demonstration. Click on the bucket name you created and upload both the LT67_training.csv and LT67_heldout.csv files by selecting Upload from the Actions drop-down menu: Both files are small, only a few KB, and hosting costs should remain negligible for that exercise. Note that for each file, by selecting the Properties tab on the right, you can specify how your files are accessed, what user, role, group or AWS service may download, read, write, and delete the files, and whether or not they should be accessible from the Open Web. When creating the datasource in Amazon ML, you will be prompted to grant Amazon ML access to your input data. You can specify the access rules to these files now in S3 or simply grant access later on. Our data is now in the cloud in an S3 bucket. We need to tell Amazon ML where to find that input data by creating a datasource. We will first create the datasource for the training file ST67_training.csv. Declaring a datasource Go to the Amazon ML dashboard, and click on Create new... | Datasource and ML model. We will use the faster flow available by default: As shown in the following screenshot, you are asked to specify the path to the LT67_training.csv file {S3://bucket}{path}{file}. Note that the S3 location field automatically populates with the bucket names and file names that are available to your user: Specifying a Datasource name is used to organize your Amazon ML assets. By clicking on Verify, Amazon ML will make sure that it has the proper rights to access the file. In case it needs to be granted access to the file, you will be prompted to do so as shown in the following screenshot: Just click on Yes to grant access. At this point, Amazon ML will validate the datasource and analyze its contents. Creating the datasource An Amazon ML datasource is composed of the following: The location of the data file: The data file is not duplicated or cloned in Amazon ML but accessed from S3 The schema that contains information on the type of the variables contained in the CSV file: Categorical Text Numeric (real-valued) Binary It is possible to supply Amazon ML with your own schema or modify the one created by Amazon ML. At this point, Amazon ML has a pretty good idea of the type of data in your training dataset. It has identified the different types of variables and knows how many rows it has: Move on to the next step by clicking on Continue, and see what schema Amazon ML has inferred from the dataset as shown in the next screenshot: Amazon ML needs to know at that point which is the variable you are trying to predict. Be sure to tell Amazon ML the following: The first line in the CSV file contains te column name The target is the weight We see here that Amazon ML has correctly inferred the following: sex is categorical age, height, and weight are numeric (continuous real values) Since we chose a numeric variable as the target Amazon ML, will use Linear Regression as the predictive model. For binary or categorical values, we would have used Logistic Regression. This means that Amazon ML will try to find the best a, b, and c coefficients so that the weight predicted by the following equation is as close as possible to the observed real weight present in the data: predicted weight = a * age + b * height + c * sex Amazon ML will then ask you if your data contains a row identifier. In our present case, it does not. Row identifiers are used when you want to understand the prediction obtained for each row or add an extra column to your dataset later on in your project. Row identifiers are for reference purposes only and are not used by the service to build the model. You will be asked to review the datasource. You can go back to each one of the previous steps and edit the parameters for the schema, the target, and the input data. Now that the data is known to Amazon ML, the next step is to set up the parameters of the algorithm that will train the model. The machine learning model We select the default parameters for the training and evaluation settings. Amazon ML will do the following: Create a step for data transformation based on the statistical properties it has inferred from the dataset Split the dataset (ST67_training.csv) into a training part and a validation part, with a 70/30 split. The split strategy assumes the data has already been shuffled and can be split sequentially. The step will be used to transform the data in a similar way for the training and the validation datasets. The only transformation suggested by Amazon ML is to transform the categorical variable sex into a binary variable, where m = 0 and f = 1 for instance. No other transformation is needed. The default advanced settings for the model are shown in the following screenshot: We see that Amazon ML will pass over the data 10 times, shuffle splitting the data each time. It will use an L2 regularization strategy based on the sum of the square of the coefficients of the regression to prevent overfitting. We will evaluate the predictive power of the model using our LT67_heldout.csv dataset later on. Regularization comes in 3 levels with a mild (10^-6), medium (10^-4), or aggressive (10^-02) setting, each value stronger than the previous one. The default setting is mild, the lowest, with a regularization constant of 0.00001 (10^-6) implying that Amazon ML does not anticipate much overfitting on this dataset. This makes sense when the number of predictors, three in our case, is much smaller than the number of samples (190 for the training set). Clicking on the Create ML model button will launch the model creation. This takes a few minutes to resolve, depending on the size and complexity of your dataset. You can check its status by refreshing the model page. In the meantime, the model status remains pending. At that point, Amazon ML will split our training dataset into two subsets: a training and a validation set. It will use the training portion of the data to train several settings of the algorithm and select the best one based on its performance on the training data. It will then apply the associated model to the validation set and return an evaluation score for that model. By default, Amazon ML will sequentially take the first 70% of the samples for training and the remaining 30% for validation. It's worth noting that Amazon ML will not create two extra files and store them on S3, but instead create two new datasources out of the initial datasource we have previously defined. Each new datasource is obtained from the original one via a Data rearrangement JSON recipe such as the following: { "splitting": { "percentBegin": 0, "percentEnd": 70 } } You can see these two new datasources in the Datasource dashboard. Three datasources are now available where there was initially only one, as shown by the following screenshot: While the model is being trained, Amazon ML runs the Stochastic Gradient algorithm several times on the training data with different parameters: Varying the learning rate in increments of powers of 10: 0.01, 0.1, 1, 10, and 100. Making several passes over the training data while shuffling the samples before each path. At each pass, calculating the prediction error, the Root Mean Squared Error (RMSE), to estimate how much of an improvement over the last pass was obtained. If the decrease in RMSE is not really significant, the algorithm is considered to have converged, and no further pass shall be made. At the end of the passes, the setting that ends up with the lowest RMSE wins, and the associated model (the weights of the regression) is selected as the best version. Once the model has finished training, Amazon ML evaluates its performance on the validation datasource. Once the evaluation itself is also ready, you have access to the model's evaluation. The Amazon ML flow is smooth and facilitates the inherent data science loop: data, model, evaluation, and prediction. We looked at an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. We discussed two objects of the Amazon ML menu: Datasource and ML model. If you found this post useful, be sure to check out the book 'Effective Amazon Machine Learning' to learn about evaluation and prediction in Amazon ML along with other AWS ML concepts. Integrate applications with AWS services: Amazon DynamoDB & Amazon Kinesis [Tutorial] AWS makes Amazon Rekognition, its image recognition AI, available for Asia-Pacific developers

0
0
3573

article-image-top-research-papers-nips-2017-part-2

Sugandha Lahoti

07 Dec 2017

8 min read

Top Research papers showcased at NIPS 2017 - Part 2

Sugandha Lahoti

07 Dec 2017

8 min read

Continuing from where we left our previous post, we are back with a quick roundup of top research papers on Machine Translation, Predictive Modelling, Image-to-Image Translation, and Recommendation Systems from NIPS 2017. Machine Translation In layman terms, Machine translation (MT) is the process by which a computer software translates a text from one natural language to another. This year at NIPS, a large number of presentations focused on innovative ways of improving translations. Here are our top picks. Value Networks: Improving beam search for better Translation Microsoft has ventured into translation tasks with the introduction of Value Networks in their paper “Decoding with Value Networks for Neural Machine Translation”. Their prediction network improves beam search which is a shortcoming of Neural Machine Translation (NMT). This new methodology inspired by the success of AlphaGo, takes the source sentence x, the currently available decoding output y1, ··· , yt1 and a candidate word w at step t as inputs, using which it predicts the long-term value (e.g., BLEU score) of the partial target sentence if it is completed by the NMT(Neural Machine Translational) model. Experiments show that this approach significantly improves the translation accuracy of several translation tasks. CoVe: Contextualizing Word Vectors for Machine Translation Salesforce researchers have used a new approach to contextualize word vectors in their paper “Learned in Translation: Contextualized Word Vectors”. A wide variety of common NLP tasks namely sentiment analysis, question classification, entailment, and question answering use only supervised word and character vectors to contextualize Word vectors. The paper uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation. Their research portrays that adding these context vectors (CoVe) improves performance over using only unsupervised word and character vectors. For fine-grained sentiment analysis and entailment also, CoVe improves the performance of the baseline models to the state-of-the-art. Predictive Modelling A lot of research showcased at NIPS was focussed around improving the predictive capabilities of Neural Networks. Here is a quick look at the top presentations. Deep Ensembles for Predictive Uncertainty Estimation Bayesian Solutions are most frequently used in quantifying predictive uncertainty in Neural networks. However, these solutions can at times be computationally intensive. They also require significant modifications to the training pipeline. DeepMind researchers have proposed an alternative to Bayesian NNs in their paper “Simple and scalable predictive uncertainty estimation using deep ensembles”. Their proposed method is easy to implement, readily parallelizable requires very little hyperparameter tuning, and yields high-quality predictive uncertainty estimates. VAIN: Scaling Multi-agent Predictive Modelling Multi-agent predictive modeling predicts the behavior of large physical or social systems by an interaction between various agents. However, most approaches come at a prohibitive cost. For instance, Interaction Networks (INs) were not able to scale with the number of interactions in the system (typically quadratic or higher order in the number of agents). Facebook researchers have introduced VAIN, which is a simple attentional mechanism for multi-agent predictive modeling that scales linearly with the number of agents. They can achieve similar accuracy but at a much lower cost. You can read more about the mechanism in their paper “VAIN: Attentional Multi-agent Predictive Modeling” PredRNN: RNNs for Predictive Learning with ST-LSTM Another paper titled “PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs” showcased a new predictive recurrent neural network. This architecture is based on the idea that spatiotemporal predictive learning should memorize both spatial appearances and temporal variations in a unified memory pool. The core of this RNN is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal representations simultaneously. Memory states are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. PredRNN is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures. It achieved state-of-the-art prediction performance on three video prediction datasets. Recommendation Systems New researches were presented by Google and Microsoft to address the cold-start problem and to build robust and powerful of Recommendation systems. Off-Policy Evaluation For Slate Recommendation Microsoft researchers have studied and evaluated policies that recommend an ordered set of items in their paper “Off-Policy Evaluation For Slate Recommendation”. General recommendation approaches require large amounts of logged data to evaluate whole-page metrics that depend on multiple recommended items, which happens when showing ranked lists. The number of these possible lists is called as slates. Microsoft researchers have developed a technique for evaluating page-level metrics of such policies offline using logged past data, reducing the need for online A/B tests. Their method models the observed quality of the recommended set as an additive decomposition across items. It fits many realistic measures of quality and shows exponential savings in the amount of required data compared with other off-policy evaluation approaches. Meta-Learning on Cold-Start Recommendations Matrix Factorization techniques for product recommendations, although efficient, suffer from serious cold-start problems. The cold start problem concerns with the recommendations for users with no or few past history i.e new users. Providing recommendations to such users becomes a difficult problem for recommendation models because their learning and predictive ability are limited. Google researchers have come up with a meta-learning strategy to address item cold-start when new items arrive continuously. Their paper “A Meta-Learning Perspective on Cold-Start Recommendations for Items” has two deep neural network architectures that implement this meta-learning strategy. The first architecture learns a linear classifier whose weights are determined by the item history while the second architecture learns a neural network whose biases are instead adjusted. On evaluating this technique on the real-world problem of Tweet recommendation, the proposed techniques significantly beat the MF baseline. Image-to-Image Translation NIPS 2017 exhibited a new image-to-image translation system, a model to hide images within images, and use of feature transforms to improve universal style. Unsupervised Image-to-Image Translation Researchers at Nvidia have proposed an unsupervised image-to-image translation framework based on Coupled GANs. Unsupervised image-to-image translation learns a joint distribution of images in different domains by using images from the marginal distributions in individual domains. However, there exists an infinite set of joint distributions that can arrive from the given marginal distributions. So, one could infer nothing about the joint distribution from the marginal distributions, without additional assumptions. Their paper “Unsupervised Image-to-Image Translation Networks ” uses a shared-latent space assumption to address this issue. Their method presents high-quality image translation results on various challenging unsupervised image translation tasks, such as street scene image translation, animal image translation, and face image translation. Deep Steganography Steganography is commonly used to unobtrusively hide a small message within the noisy regions of a larger image. Google researchers in their paper “Hiding Images in Plain Sight: Deep Steganography” have demonstrated the successful application of deep learning to hiding images. They have placed a full-size color image within another image of the same size. They have also trained Deep neural networks to create the hiding and revealing processes and are designed to specifically work as a pair. Their approach compresses and distributes the secret image's representation across all of the available bits, instead of encoding the secret message within the least significant bits of the carrier image. This system is trained on images drawn randomly from the ImageNet database and works well on natural images. Improving Universal style transfer on images NIPS 2017 witnessed another paper aimed at improving the Universal Style Transfer. Universal style transfer is used for transferring arbitrary visual styles to content images. The paper “Universal Style Transfer via Feature Transforms” by Nvidia researchers highlight feature transforms, as a simple yet effective method to tackle the limitations of existing feed-forward methods for Universal Style Transfer, without training on any pre-defined styles. Existing feed-forward based methods are mainly limited by the inability of generalizing to unseen styles or compromised visual quality. The research paper embeds a pair of feature transforms, whitening and coloring, to an image reconstruction network. The whitening and coloring transform reflect a direct matching of feature covariance of the content image to a given style image. The algorithm can generate high-quality stylized images with comparisons to a number of recent methods. Key Takeaways from NIPS 2017 The Research papers covered in this and the previous post highlight that most organizations are at the forefront of machine learning and are actively exploring virtually all aspects of the field. Deep learning practices were also in trend. The conference was focussed on the current state and recent advances in Deep Learning. A lot of talks and presentations were about industry-ready neural networks suggesting a fast transition from research to industry. Researchers are also focusing on areas of language understanding, speech recognition, translation, visual processing, and prediction. Most of these techniques rely on using GANs as the backend. For live content coverage, you can visit NIPS’ Facebook page.

0
0
3563

article-image-indexing-and-performance-tuning

Packt

10 Oct 2014

40 min read

Indexing and Performance Tuning

Packt

10 Oct 2014

40 min read

0
0
3541

article-image-hadoop-and-hdinsight-heartbeat

Packt

30 Sep 2013

6 min read

Hadoop and HDInsight in a Heartbeat

Packt

30 Sep 2013

6 min read

(For more resources related to this topic, see here.) Apache Hadoop is the leading Big Data platform that allows to process large datasets efficiently and at low cost. Other Big Data 0platforms are MongoDB, Cassandra, and CouchDB. This section describes Apache Hadoop core concepts and its ecosystem. Core components The following image shows core Hadoop components: At the core, Hadoop has two key components: Hadoop Distributed File System (HDFS) Hadoop MapReduce (distributed computing for batch jobs) For example, say we need to store a large file of 1 TB in size and we only have some commodity servers each with limited storage. Hadoop Distributed File System can help here. We first install Hadoop, then we import the file, which gets split into several blocks that get distributed across all the nodes. Each block is replicated to ensure that there is redundancy. Now we are able to store and retrieve the 1 TB file. Now that we are able to save the large file, the next obvious need would be to process this large file and get something useful out of it, like a summary report. To process such a large file would be difficult and/or slow if handled sequentially. Hadoop MapReduce was designed to address this exact problem statement and process data in parallel fashion across several machines in a fault-tolerant mode. MapReduce programing models use simple key-value pairs for computation. One distinct feature of Hadoop in comparison to other cluster or grid solutions is that Hadoop relies on the "share nothing" architecture. This means when the MapReduce program runs, it will use the data local to the node, thereby reducing network I/O and improving performance. Another way to look at this is when running MapReduce, we bring the code to the location where the data resides. So the code moves and not the data. HDFS and MapReduce together make a powerful combination, and is the reason why there is so much interest and momentum with the Hadoop project. Hadoop cluster layout Each Hadoop cluster has three special master nodes (also known as servers): NameNode: This is the master for the distributed filesystem and maintains a metadata. This metadata has the listing of all the files and the location of each block of a file, which are stored across the various slaves (worker bees). Without a NameNode HDFS is not accessible. Secondary NameNode: This is an assistant to the NameNode. It communicates only with the NameNode to take snapshots of the HDFS metadata at intervals configured at cluster level. JobTracker: This is the master node for Hadoop MapReduce. It determines the execution plan of the MapReduce program, assigns it to various nodes, monitors all tasks, and ensures that the job is completed by automatically relaunching any task that fails. All other nodes of the Hadoop cluster are slaves and perform the following two functions: DataNode: Each node will host several chunks of files known as blocks. It communicates with the NameNode. TaskTracker: Each node will also serve as a slave to the JobTracker by performing a portion of the map or reduce task, as decided by the JobTracker. The following image shows a typical Apache Hadoop cluster: The Hadoop ecosystem As Hadoop's popularity has increased, several related projects have been created that simplify accessibility and manageability to Hadoop. I have organized them as per the stack, from top to bottom. The following image shows the Hadoop ecosystem: Data access The following software are typically used access mechanisms for Hadoop: Hive: It is a data warehouse infrastructure that provides SQL-like access on HDFS. This is suitable for the ad hoc queries that abstract MapReduce. Pig: It is a scripting language such as Python that abstracts MapReduce and is useful for data scientists. Mahout: It is used to build machine learning and recommendation engines. MS Excel 2013: With HDInsight, you can connect Excel to HDFS via Hive queries to analyze your data. Data processing The following are the key programming tools available for processing data in Hadoop: MapReduce: This is the Hadoop core component that allows distributed computation across all the TaskTrackers Oozie: It enables creation of workflow jobs to orchestrate Hive, Pig, and MapReduce tasks The Hadoop data store The following are the common data stores in Hadoop: HBase: It is the distributed and scalable NOSQL (Not only SQL) database that provides a low-latency option that can handle unstructured data HDFS: It is a Hadoop core component, which is the foundational distributed filesystem Management and integration The following are the management and integration software: Zookeeper: It is a high-performance coordination service for distributed applications to ensure high availability Hcatalog: It provides abstraction and interoperability across various data processing software such as Pig, MapReduce, and Hive Flume: Flume is distributed and reliable software for collecting data from various sources for Hadoop Sqoop: It is designed for transferring data between HDFS and any RDBMS Hadoop distributions Apache Hadoop is an open-source software and is repackaged and distributed by vendors offering enterprise support. The following is the listing of popular distributions: Amazon Elastic MapReduce (cloud, http://aws.amazon.com/elasticmapreduce/) Cloudera (http://www.cloudera.com/content/cloudera/en/home.html) EMC PivitolHD (http://gopivotal.com/) Hortonworks HDP (http://hortonworks.com/) MapR (http://mapr.com/) Microsoft HDInsight (cloud, http://www.windowsazure.com/) HDInsight distribution differentiator HDInsight is an enterprise-ready distribution of Hadoop that runs on Windows servers and on Azure HDInsight cloud service. It is 100 percent compatible with Apache Hadoop. HDInsight was developed in partnership with Hortonworks and Microsoft. Enterprises can now harness the power of Hadoop on Windows servers and Windows Azure cloud service. The following are the key differentiators for HDInsight distribution: Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with Platform as a Service ( PaaS ) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can leverage data in Hadoop and analyze using PowerPivot. Integration with Active Directory: HDInsight makes Hadoop reliable and secure with its integration with Windows Active directory services. Integration with .NET and JavaScript: .NET developers can leverage the integration, and write map and reduce code using their familiar tools. Connectors to RDBMS: HDInsight has ODBC drivers to integrate with SQL Server and other relational databases. Scale using cloud offering: Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure storage vault. JavaScript console: It consists of easy-to-use JavaScript console for configuring, running, and post processing of Hadoop MapReduce jobs. Summary In this article, we reviewed the Apache Hadoop components and the ecosystem of projects that provide a cost-effective way to deal with Big Data problems. We then looked at how Microsoft HDInsight makes the Apache Hadoop solution better by simplified management, integration, development, and reporting. Resources for Article : Further resources on this subject: Making Big Data Work for Hadoop and Solr [Article] Understanding MapReduce [Article] Advanced Hadoop MapReduce Administration [Article]

0
0
3534

How-To Tutorials - Data

Working with Incanter Datasets

Writing Consumers

How to maintain Apache Mesos

Ten IPython essentials

Stream Grouping

Making 3D Visualizations

Data Science with R

About Cassandra

Creating Interactive Graphics and Animation

Integrating Solr: Ruby on Rails Integration

Trending Topics

Work Item Querying

Understanding Amazon Machine Learning Workflow [ Tutorial ]

Top Research papers showcased at NIPS 2017 - Part 2

Indexing and Performance Tuning

Hadoop and HDInsight in a Heartbeat

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access