Data | Tech News, Tutorials & Expert Insights

article-image-how-to-use-standard-macro-in-workflows

21 Feb 2018

6 min read

How to use Standard Macro in Workflows

21 Feb 2018

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Renato Baruti titled Learning Alteryx. In this book, you will learn how to perform self-analytics and create interactive dashboards using various tools in Alteryx.[/box] Today we will learn Standard Macro that will provide you with a foundation for building enhanced workflows. The csv file required for this tutorial is available to download here. Standard Macro Before getting into Standard Macro, let's define what a macro is. A macro is a collection of workflow tools that are grouped together into one tool. Using a range of different interface tools, a macro can be developed and used within a workflow. Any workflow can be turned into a macro and a repeatable element of a workflow can commonly be converted into a macro. There are a couple of ways you can turn your workflow into a Standard Macro. The first is to go to the canvas configuration pane and navigate to the Workflow tab. This is where you select what type of workflow you want. If you select Macro you should then have Standard Macro automatically selected. Now, when you save this workflow it will save as a macro. You’ll then be able to add it to another workflow and run the process created within the macro itself. The second method is just to add a Macro Input tool from the Interface tool section onto the canvas; the workflow will then automatically change to a Standard Macro. The following screenshot shows the selection of a Standard Macro, under the Workflow tab: Let's go through an example of creating and deploying a standard macro. Standard Macro Example #1: Create a macro that allows the user to input a number used as a multiplier. Use the multiplier for the DataValueAlt field. The following steps demonstrate this process: Step 1: Select the Macro Input tool from the Interface tool palette and add the tool onto the canvas. The workflow will automatically change to a Standard Macro. Step 2: Select Text Input and Edit Data option within the Macro Input tool configuration. Step 3: Create a field called Number and enter the values: 155, 243, 128, 352, and 357 in each row, as shown in the following image: Step 4: Rename the Input Name Input and set the Anchor Abbreviation as I as shown in the following image: Step 5: Select the Formula tool from the Preparation tool palette. Connect the Formula tool to the Macro Input tool. Step 6: Select the + Add Column option in the Select Column drop down within the Formula tool configuration. Name the field Result. Step 7: Add the following expression to the expression window: [Number]*0.50 Step 8: Select the Macro Output tool from the Interface tool palette and add the tool onto the canvas. Connect the Macro Output tool to the Formula tool. Step 9: Rename the Output Name Output and set the Anchor Abbreviation as O: The Standard Macro has now been created. It can be saved to use as multiplier, to calculate the five numbers added within the Macro Input tool to multiply 0.50. This is great; however, let's take it a step further to make it dynamic and flexible by allowing the user to enter a multiplier. For instance, currently the multiplier is set to 0.50, but what if a user wants to change that to 0.25 or 0.10 to determine the 25% or 10% value of a field. Let's continue building out the Standard Macro to make this possible. Step 1: Select the Text Box tool from the Interface tool palette and drag it onto the canvas. Connect the Text Box tool to the Formula tool on the lightning bolt (the macro indicator). The Action tool will automatically be added to the canvas, as this automatically updates the configuration of a workflow with values provided by interface questions when run as an app or macro. Step 2: Configure the Action tool that will automatically update the expression replaced by a specific field. Select Formula | FormulaFields | FormulaField | @expression - value="[Number]*0.50". Select the Replace a specific string: option and enter 0.50. This is where the automation happens, updating the 0.50 to any number the user enters. You will see how this happens in the following steps: Step 3: In the Enter the text or question to be displayed text box, within the Text Box tool configuration, enter: Please enter a number: Step 4: Save the workflow as Standard Macro.yxmc. The .yxmc file type indicates it's a macro related workflow, as shown in the following image: Step 5: Open a new workflow. Step 6: Select the Input Data tool from the In/Out tool palette and connect to the U.S. Chronic Disease Indicators.csv file. Step 7: Select the Select tool from the Preparation tool palette and drag it onto the canvas. Connect the Select tool to the Input Data tool. Step 8: Change the Data Type for the DataValueAlt field to Double. Step 9: Right-click on the canvas and select Insert | Macro | Standard Macro. Step 10: Connect the Standard Macro to the Select tool. Step 11: There will be Questions to select within the Standard Macro tool configuration. Select DataValueAlt (Double) as the Choose Field option and enter 0.25 in the Please enter a number text box: Step 12: Add a Browse tool to the Standard Macro tool. Step 13: Run the workflow: The goal for creating this Standard Macro was to allow the user to select what they would like the multiplier to be rather than a static number. Let's recap what has been created and deployed using a Standard Macro. First, Standard Macro.yxmc was developed using Interface tools. The Macro Input (I) was used to enter sample text data for the Number field. This Number field is what is used to multiply to what the given multiplier is - in this case, 0.50. This is the static number multiplier. The Formula tool was used to create the expression to conclude that the Number field will be multiplied by 0.50. The Macro Output (O) was used to output the macro so that it can be used in another workflow. The Text Box tool is where the question Please enter a number will be displayed, along with the Action tool that is used to update the specific value replaced. The current multiplier, 0.50, is replaced by 0.25, as identified in step 20, through a dynamic input by which the user can enter the multiplier. Notice that, in the Browse tool output, the Result field has been added, multiplying the values for the DataValueAlt field to the multiplier 0.25. Change the value in the macro to 0.10 and run the workflow. The Result field has been updated to now multiple the values for the DataValueAlt field to the multiplier 0.10. This is a great use case of a Standard Macro and demonstrates how versatile the Interface tools are. We learned about macros and their dynamic use within workflows. We saw how Standard Macro was developed to allow the end user to specify what they want the multiplier to be. This is a great way to implement the interactivity within a workflow. To know more about high-quality interactive dashboards and efficient self-service data analytics, do checkout this book Learning Alteryx.

0
0
6426

article-image-scripting-capabilities-elasticsearch

Packt

08 Jan 2016

19 min read

The scripting Capabilities of Elasticsearch

Packt

08 Jan 2016

19 min read

In this article by Rafał Kuć and Marek Rogozinski author of the book Elasticsearch Server - Third Edition, Elasticsearch has a few functionalities in which scripts can be used. Even though scripts seem to be a rather advanced topic, we will look at the possibilities offered by Elasticsearch. That's because scripts are priceless in certain situations. Elasticsearch can use several languages for scripting. When not explicitly declared, it assumes that Groovy (http://www.groovy-lang.org/) is used. Other languages available out of the box are the Lucene expression language and Mustache (https://mustache.github.io/). Of course, we can use plugins that will make Elasticsearch understand additional scripting languages such as JavaScript, Mvel, or Python. One thing worth mentioning is this: independently from the scripting language that we will choose, Elasticsearch exposes objects that we can use in our scripts. Let's start by briefly looking at what type of information we are allowed to use in our scripts. (For more resources related to this topic, see here.) Objects available during script execution During different operations, Elasticsearch allows us to use different objects in our scripts. To develop a script that fits our use case, we should be familiar with those objects. For example, during a search operation, the following objects are available: _doc (also available as doc): An instance of the org.elasticsearch.search.lookup.LeafDocLookup object. It gives us access to the current document found with the calculated score and field values. _source: An instance of the org.elasticsearch.search.lookup.SourceLookup object. It provides access to the source of the current document and the values defined in the source. _fields: An instance of the org.elasticsearch.search.lookup.LeafFieldsLookup object. It can be used to access the values of the document fields. On the other hand, during a document update operation, the variables mentioned above are not accessible. Elasticsearch exposes only the ctx object with the _source property, which provides access to the document currently processed in the update request. As we have previously seen, several methods are mentioned in the context of document fields and their values. Let's now look at the examples of how to get the value for a particular field using the previously mentioned object available during search operations. In the brackets, you can see what Elasticsearch will return for one of our example documents from the library index (we will use the document with identifier 4): _doc.title.value (and) _source.title (crime and punishment) _fields.title.value (null) A bit confusing, isn't it? During indexing, the original document is, by default, stored in the _source field. Of course, by default, all fields are present in that _source field. In addition to this, the document is parsed, and every field may be stored in an index if it is marked as stored (that is, if the store property is set to true; otherwise, by default, the fields are not stored). Finally, the field value may be configured as indexed. This means that the field value is analyzed and placed in the index. To sum up, one field may land in an Elasticsearch index in the following ways: As part of the _source document As a stored and unparsed original value As an indexed value that is processed by an analyzer In scripts, we have access to all of these field representations. The only exception is the update operation, which—as we've mentioned before—gives us access to only the _source document as part of the ctx variable. You may wonder which version you should use. Well, if we want access to the processed form, the answer would be simple—use the _doc object. What about _source and _fields? In most cases, _source is a good choice. It is usually fast and needs fewer disk operations than reading the original field values from the index. This is especially true when you need to read values of multiple fields in your scripts—fetching a single _source field is faster than fetching multiple independent fields from the index. Script types Elasticsearch allows us to use scripts in three different ways: Inline scripts: The source of the script is directly defined in the query In-file scripts: The source is defined in the external file placed in the Elasticsearch config/scripts directory As a document in the dedicated index: The source of the script is defined as a document in a special index available by using the /_scripts API endpoint Choosing the way of defining scripts depends on several factors. If you have scripts that you will use in many different queries, the file or the dedicated index seems to be the best solution. "Scripts in the file" is probably less convenient, but it is preferred from the security point of view—they can't be overwritten and injected into your query, which might have caused a security breach. In-file scripts This is the only way that is turned on by default in Elasticsearch. The idea is that every script used by the queries is defined in its own file placed in the config/scripts directory. We will now look at this method of using scripts. Let's create an example file called tag_sort.groovy and place it in the config/scripts directory of our Elasticsearch instance (or instances if we are running a cluster). The content of the mentioned file should look like this: _doc.tags.values.size() > 0 ? _doc.tags.values[0] : 'u19999' After a few seconds, Elasticsearch should automatically load a new file. You should see something like this in the Elasticsearch logs: [2015-08-30 13:14:33,005][INFO ][script ] [Alex Wilder] compiling script file [/Users/negativ/Developer/ES/es-current/config/scripts/tag_sort.groovy] If you have a multinode cluster, you have to make sure that the script is available on every node. Now we are ready to use this script in our queries. A modified query that uses our script stored in the file looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "match_all" : { } }, "sort" : { "_script" : { "script" : { "file" : "tag_sort" }, "type" : "string", "order" : "asc" } } }' First, we will see the next possible way of defining a script inline. Inline scripts Inline scripts are a more convenient way of using scripts, especially for constantly changing queries or ad-hoc queries. The main drawback of such an approach is security. If we do this, we allow users to run any kind of query, including any kind of script that can be used by attackers. Such an attack can execute arbitrary code on the server running Elasticsearch with rights equal to the ones given to the user who is running Elasticsearch. In the worst-case scenario, an attacker could use security holes to gain superuser rights. This is why inline scripts are disabled by default. After careful consideration, you can enable them by adding this to the elasticsearch.yml file: script.inline: on After allowing the inline script to be executed, we can run a query that looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "match_all" : { } }, "sort" : { "_script" : { "script" : { "inline" : "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999"" }, "type" : "string", "order" : "asc" } } }' Indexed scripts The last option for defining scripts is to store them in the dedicated Elasticsearch index. From the same security reasons, dynamic execution of indexed scripts is by default disabled. To enable indexed scripts, we have to add a configuration similar option to the one that we've added to be able to use inline scripts. We need to add the following line to the elasticsearch.yml file: script.indexed: on After adding the above property to all the nodes and restarting the cluster, we will be ready to start using indexed scripts. Elasticsearch provides additional dedicated endpoints for this purpose. Let's store our script: curl -XPOST 'localhost:9200/_scripts/groovy/tag_sort' -d '{ "script" : "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999"" }' The script is ready, but let's discuss what we just did. We sent an HTTP POST request to the special _scripts REST endpoint. We also specified the language of the script (groovy in our case) and the name of the script (tag_sort). The body of the request is the script itself. We can now move on to the query, which looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "match_all" : { } }, "sort" : { "_script" : { "script" : { "id" : "tag_sort" }, "type" : "string", "order" : "asc" } } }' As we can see, this query is practically identical to the query used with the script defined in a file. The only difference is the id parameter instead of file. Querying with scripts If we look at any request made to Elasticsearch that uses scripts, we will notice some similar properties, which are as follows: script: The property that wraps the script definition. inline: The property holding the code of the script itself. id – This is the property that defines the identifier of the indexed script. file: The filename (without extension) with the script definition when the in file script is used. lang: This is the property defining the script language. If it is omitted, Elasticsearch assumes groovy. params: This is an object containing parameters and their values. Every defined parameter can be used inside the script by specifying that parameter name. Parameters allow us to write cleaner code that will be executed in a more efficient manner. Scripts that use parameters are executed faster than code with embedded constants because of caching. Scripting with parameters As our scripts become more and more complicated, the need for creating multiple, almost identical scripts can appear. Those scripts usually differ in the values used, with the logic behind them being exactly the same. In our simple example, we have used a hardcoded value to mark documents with an empty tags list. Let's change this to allow the definition of a hardcoded value. Let's use in the file script definition and create the tag_sort_with_param.groovy file with the following contents: _doc.tags.values.size() > 0 ? _doc.tags.values[0] : tvalue The only change we've made is the introduction of a parameter named tvalue, which can be set in the query in the following way: curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "match_all" : { } }, "sort" : { "_script" : { "script" : { "file" : "tag_sort_with_param", "params" : { "tvalue" : "000" } }, "type" : "string", "order" : "asc" } } }' The params section defines all the script parameters. In our simple example, we've only used a single parameter, but of course, we can have multiple parameters in a single query. Script languages The default language for scripting is Groovy. However, you are not limited to only a single scripting language when using Elasticsearch. In fact, if you would like to, you can even use Java to write your scripts. In addition to that, the community behind Elasticsearch provides support of more languages as plugins. So, if you are willing to install plugins, you can extend the list of scripting languages that Elasticsearch supports even further. You may wonder why you should even consider using a scripting language other than the default Groovy. The first reason is your own preferences. If you are a Python enthusiast, you are probably now thinking about how to use Python for your Elasticsearch scripts. The other reason could be security. When we talked about inline scripts, we told you that inline scripts are turned off by default. This is not exactly true for all the scripting languages available out of the box. Inline scripts are disabled by default when using Grooby, but you can use Lucene expressions and Mustache without any issues. This is because those languages are sandboxed, which means that security-sensitive functions are turned off. And of course, the last factor when choosing the language is performance. Theoretically, native scripts (in Java) should have better performance than others, but you should remember that the difference can be insignificant. You should always consider the cost of development and measure the performance. Using something other than embedded languages Using Groovy for scripting is a simple and sufficient solution for most use cases. However, you may have a different preference and you would like to use something different, such as JavaScript, Python, or Mvel. For now, we'll just run the following command from the Elasticsearch directory: bin/plugin install elasticsearch/elasticsearch-lang-javascript/2.7.0 The preceding command will install a plugin that will allow the use of JavaScript as the scripting language. The only change we should make in the request is putting in additional information about the language we are using for scripting. And of course, we have to modify the script itself to correctly use the new language. Look at the following example: curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "match_all" : { } }, "sort" : { "_script" : { "script" : { "inline" : "_doc.tags.values.length > 0 ? _doc.tags.values[0] :"u19999";", "lang" : "javascript" }, "type" : "string", "order" : "asc" } } }' As you can see, we've used JavaScript for scripting instead of the default Groovy. The lang parameter informs Elasticsearch about the language being used. Using native code If the scripts are too slow or if you don't like scripting languages, Elasticsearch allows you to write Java classes and use them instead of scripts. There are two possible ways of adding native scripts: adding classes that define scripts to the Elasticsearch classpath, or adding a script as a functionality provided by plugin. We will describe the second solution as it is more elegant. The factory implementation We need to implement at least two classes to create a new native script. The first one is a factory for our script. For now, let's focus on it. The following sample code illustrates the factory for our script: package pl.solr.elasticsearch.examples.scripts; import java.util.Map; import org.elasticsearch.common.Nullable; import org.elasticsearch.script.ExecutableScript; import org.elasticsearch.script.NativeScriptFactory; public class HashCodeSortNativeScriptFactory implements NativeScriptFactory { @Override public ExecutableScript newScript(@Nullable Map<String, Object> params) { return new HashCodeSortScript(params); } @Override public boolean needsScores() { return false; } } This class should implement the org.elasticsearch.script.NativeScriptFactory class. The interface forces us to implement two methods. The newScript() method takes the parameters defined in the API call and returns an instance of our script. Finally, needsScores() informs Elasticsearch if we want to use scoring and that it should be calculated. Implementing the native script Now let's look at the implementation of our script. The idea is simple—our script will be used for sorting. The documents will be ordered by the hashCode() value of the chosen field. Documents without a value in the defined field will be first on the results list. We know that the logic doesn't make much sense, but it is good for presentation as it is simple. The source code for our native script looks as follows: package pl.solr.elasticsearch.examples.scripts; import java.util.Map; import org.elasticsearch.script.AbstractSearchScript; public class HashCodeSortScript extends AbstractSearchScript { private String field = "name"; public HashCodeSortScript(Map<String, Object> params) { if (params != null && params.containsKey("field")) { this.field = params.get("field").toString(); } } @Override public Object run() { Object value = source().get(field); if (value != null) { return value.hashCode(); } return 0; } } First of all, our class inherits from the org.elasticsearch.script.AbstractSearchScript class and implements the run() method. This is where we get the appropriate values from the current document, process it according to our strange logic, and return the result. You may notice the source() call. Yes, it is exactly the same _source parameter that we met in the non-native scripts. The doc() and fields() methods are also available, and they follow the same logic that we described earlier. The thing worth looking at is how we've used the parameters. We assume that a user can put the field parameter, telling us which document field will be used for manipulation. We also provide a default value for this parameter. The plugin definition We said that we will install our script as a part of a plugin. This is why we need additional files. The first file is the plugin initialization class, where we can tell Elasticsearch about our new script: package pl.solr.elasticsearch.examples.scripts; import org.elasticsearch.plugins.Plugin; import org.elasticsearch.script.ScriptModule; public class ScriptPlugin extends Plugin { @Override public String description() { return "The example of native sort script"; } @Override public String name() { return "naive-sort-plugin"; } public void onModule(final ScriptModule module) { module.registerScript("native_sort", HashCodeSortNativeScriptFactory.class); } } The implementation is easy. The description() and name() methods are only for information purposes, so let's focus on the onModule() method. In our case, we need access to script module—the Elasticsearch service connected with scripts and scripting languages. This is why we define onModule() with one ScriptModule argument. Thanks to Elasticsearch magic, we can use this module and register our script so that it can be found by the engine. We have used the registerScript() method, which takes the script name and the previously defined factory class. The second file needed is a plugin descriptor file: plugin-descriptor.properties. It defines the constants used by the Elasticsearch plugin subsystem. Without thinking more, let's look at the contents of this file: jvm=true classname=pl.solr.elasticsearch.examples.scripts.ScriptPlugin elasticsearch.version=2.0.0-beta2-SNAPSHOT version=0.0.1-SNAPSHOT name=native_script description=Example Native Scripts java.version=1.7 The appropriate lines have the following meaning: jvm: This tells Elasticsearch that our file contains Java code classname: This describes the main class with the plugin definition elasticsearch.version and java.version: They tell about the Elasticsearch and Java versions needed for our plugin name and description: These are an informative name and a short description of our plugin And that's it! We have all the files needed to fire our script. Note that now it is quite convenient to add new scripts and pack them as a single plugin. Installing a plugin Now it's time to install our native script embedded in the plugin. After packing the compiled classes as a JAR archive, we should put it into the Elasticsearch plugins/native-script directory. The native-script part is a root directory for our plugin and you may name it as you wish. In this directory, you also need the prepared plugin-descriptor.properties file. This makes our plugin visible to Elasticsearch. Running the script After restarting Elasticsearch (or the entire cluster if you are running more than a single node), we can start sending the queries that use our native script. For example, we will send a query that uses our previously indexed data from the library index. This example query looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "match_all" : { } }, "sort" : { "_script" : { "script" : { "script" : "native_sort", "lang" : "native", "params" : { "field" : "otitle" } }, "type" : "string", "order" : "asc" } } }' Note the params part of the query. In this call, we want to sort on the otitle field. We provide the script name as native_sort and the script language as native. This is required. If everything goes well, we should see our results sorted by our custom sort logic. If we look at the response from Elasticsearch, we will see that documents without the otitle field are at the first few positions of the results list and their sort value is 0. Summary In this article, we focused on querying, but not about the matching part of it—mostly about scoring. You learned how Apache Lucene TF/IDF scoring works. We saw the scripting capabilities of Elasticsearch and handled multilingual data. We also used boosting to influence how scores of returned documents were calculated and we used synonyms. Finally, we used explain information to see how document scores were calculated by query. Resources for Article: Further resources on this subject: An Introduction to Kibana [article] Indexing the Data [article] Low-Level Index Control [article]

0
0
6356

article-image-basic-concepts-and-architecture-cassandra

Packt

21 Nov 2013

7 min read

Basic Concepts and Architecture of Cassandra

Packt

21 Nov 2013

7 min read

(For more resources related to this topic, see here.) CAP theorem If you want to understand Cassandra, you first need to understand the CAP theorem. The CAP theorem (published by Eric Brewer at the University of California, Berkeley) basically states that it is impossible for a distributed system to provide you with all of the following three guarantees: Consistency: Updates to the state of the system are seen by all the clients simultaneously Availability: Guarantee of the system to be available for every valid request Partition tolerance: The system continues to operate despite arbitrary message loss or network partition Cassandra provides users with stronger availability and partition tolerance with tunable consistency tradeoff; the client, while writing to and/or reading from Cassandra, can pass a consistency level that drives the consistency requirements for the requested operations. BigTable / Log-structured data model In a BigTable data model, the primary key and column names are mapped with their respective bytes of value to form a multidimensional map. Each table has multiple dimensions. Timestamp is one such dimension that allows the table to version the data and is also used for internal garbage collection (of deleted data). The next figure shows the data structure in a visual context; the row key serves as the identifier of the column that follows it, and the column name and value are stored in contiguous blocks: It is important to note that every row has the column names stored along with the values, allowing the schema to be dynamic. Column families Columns are grouped into sets called column families, which can be addressed through a row key (primary key). All the data stored in a column family is of the same type. A column family must be created before any data can be stored; any column key within the family can be used. It is our intent that the number of distinct column families in a keyspace should be small, and that the families should rarely change during an operation. In contrast, a column family may have an unbounded number of columns. Both disk and memory accounting are performed at the column family level. Keyspace A keyspace is a group of column families; replication strategies and ACLs are performed at the keyspace level. If you are familiar with traditional RDBMS, you can consider the keyspace as an alternative name for the schema and the column family as an alternative name for tables. Sorted String Table (SSTable) An SSTable provides a persistent file format for Cassandra; it is an ordered immutable storage structure from rows of columns (name/value pairs). Operations are provided to look up the value associated with a specific key and to iterate over all the column names and value pairs within a specified key range. Internally, each SSTable contains a sequence of row keys and a set of column key/value pairs. There is an index and the start location of the row key in the index file, which is stored separately. The index summary is loaded into the memory when the SSTable is opened in order to optimize the amount of memory needed for the index. A lookup for actual rows can be performed with a single disk seek and by scanning sequentially for the data. Memtable A memtable is a memory location where data is written to during update or delete operations. A memtable is a temporary location and will be flushed to the disk once it is full to form an SSTable. Basically, an update or a write operation to Cassandra is a sequential write to the commit log in the disk and a memory update; hence, writes are as fast as writing to memory. Once the memtables are full, they are flushed to the disk, forming new SSTables: Reads in Cassandra will merge the data from different SSTables and the data in memtables. Reads should always be requested with a row key (primary key) with the exception of a key range scan. When multiple updates are applied to the same column, Cassandra uses client-provided timestamps to resolve conflicts. Delete operations to a column work a little differently; because SSTables are immutable, Cassandra writes the tombstone to avoid random writes. A tombstone is a special value written to Cassandra instead of removing the data immediately. The tombstone can then be sent to nodes that did not get the initial remove request, and can be removed during GC. Compaction To bound the number of SSTable files that must be consulted on reads and to reclaim the space taken by unused data, Cassandra performs compactions. In a nutshell, compaction compacts n (the configurable number of SSTables) into one big SSTable. They start out being the same size as the memtables. Therefore, the sizes of the SSTables are exponentially bigger when they grow older. Partitioning and replication Dynamo style As mentioned previously, the partitioner and replication scheme is motivated by the Dynamo paper; let's talk about it in detail. Gossip protocol Cassandra is a peer-to-peer system with no single point of failure; the cluster topology information is communicated via the Gossip protocol. The Gossip protocol is similar to real-world gossip, where a node (say B) tells a few of its peers in the cluster what it knows about the state of a node (say A). Those nodes tell a few other nodes about A, and over a period of time, all the nodes know about A. Distributed hash table The key feature of Cassandra is the ability to scale incrementally. This includes the ability to dynamically partition the data over a set of nodes in the cluster. Cassandra partitions data across the cluster using consistent hashing and randomly distributes the rows over the network using the hash of the row key. When a node joins the ring, it is assigned a token that advocates where the node has to be placed in the ring: Now consider a case where the replication factor is 3; clients randomly write or read from a coordinator (every node in the system can act as a coordinator and a data node) in the cluster. The node calculates a hash of the row key and provides the coordinator enough information to write to the right node in the ring. The coordinator also looks at the replication factor and writes to the neighboring nodes in the ring order. Eventual consistency Given a sufficient period of time over which no changes are sent, all updates can be expected to propagate through the system and the replicas created will be consistent. Cassandra supports both the eventual consistency model and strong consistency model, which can be controlled from the client while performing an operation. Cassandra supports various consistency levels while writing or reading data. The consistency level drives the number data replicas the coordinator has to contact to get the data before acknowledging the clients. If W + R > Replication Factor, where W is the number of nodes to block on write and R the number to block on reads, the clients will see a strong consistency behavior: ONE: R/W at least one node TWO: R/W at least two nodes QUORUM: R/W from at least floor (N/2) + 1, where N is the replication factor When nodes are down for maintenance, Cassandra will store hints for updates performed on that node, which can be delivered back when the node is available in the future. To make data consistent, Cassandra relies on hinted handoffs, read repairs, and anti-entropy repairs. Summary In this article, we have discussed basic concepts and basic building blocks, including the motivation in building a new datastore solution. Resources for Article: Further resources on this subject: Apache Cassandra: Libraries and Applications [Article] About Cassandra [Article] Quick start – Creating your first Java application [Article]

0
0
6301

article-image-building-search-geo-locator-elasticsearch-and-spark

Packt

31 Jan 2017

12 min read

Building A Search Geo Locator with Elasticsearch and Spark

Packt

31 Jan 2017

12 min read

0
0
6282

Packt

15 Oct 2009

4 min read

Deployment of Reports with BIRT

Packt

15 Oct 2009

4 min read

Everything in this article uses utilities from the BIRT Runtime installation package, available from the BIRT homepage at http://www.eclipse.org/birt. BIRT Viewer The BIRT Viewer is a J2EE application that is designed to demonstrate how to implement the Report Engine API to execute reports in an online web application. For most basic uses—such as for small to medium size Intranet applications—this is an appropriate approach. The point to keep in mind about the BIRT Web Viewer is that it is an example application. It can be used as a baseline for more sophisticated web applications that will implement the BIRT Report Engine API. Installation of the BIRT Viewer is documented at a number of places. The Eclipse BIRT website has some great tutorials at: http://www.eclipse.org/birt/phoenix/deploy/viewerSetup.php http://wiki.eclipse.org/BIRT/FAQ/Deployment This is also documented on my website in a series of articles introducing people to BIRT: http://digiassn.blogspot.com/2005/10/birt-report-server-pt-2.html I won't go into the details about installing Apache Tomcat as this is covered in depth in other locations, but I will cover how to install the Viewer in a Tomcat environment. For the most part these instructions can be used in other J2EE containers, such as WebSphere. In some cases a WAR package is used instead. I prefer Tomcat because it is a widely used open-source J2EE environment. Under the BIRT Runtime package is a folder containing an example Web Viewer application. The Web Viewer is a useful application as you require basic report viewing capabilities, such as parameter passing, pagination, and export capabilities to formats such as Word, Excel, RTF, and CSV. For this example, I have Apache Tomcat 5.5 installed into a folder at C:apache-tomcat-5.5.25. To install the Web Viewer, I simply need to copy the WebViewerExample folder from the BIRT Runtime to the web application folder at C:apache-tomcat-5.5.25webapps. Accessing the BIRT Web Viewer is as simple as calling the WebViewerExample Context. When copying the WebViewerExample folder, you can rename this folder to anything you want. Obviously WebViewerExample is not a good name for an online web application. So in the following screenshot, I renamed the WebViewerExample folder to birtViewer, and am accessing the BIRT Web Viewer test report. Installing Reports into the Web Viewer Once the BIRT Viewer is set up, Deploying reports is as simple as copying the report design files, Libraries, or report documents into the application's Context, and calling it with the appropriate URL parameters. For example, we will install the reports from the Classic Cars – With Library folder into the BIRT Web Viewer at birtViewer. In order for these reports to work, all dependent Libraries need to be installed with the reports. In the case of the example application, we currently have the report folder set to the Root of the web application folder. Accessing Reports in the Web Viewer Accessing reports is as simple as passing the correct parameters to the Web Viewer. In the BIRT Web Viewer, there are seven servlets that you can call to run reports, which are as follows: frameset run preview download parameter document output Out of these, you will only need frameset and run as the other servlets are for Engine-related purposes, such as the preview for the Eclipse designer, the parameter Dialog, and the download of report documents. Out of the these two servlets, frameset is the one that is typically used for user interaction with reports, as it provides the pagination options, parameter Dialogs, table of contents viewing, and export and print Dialogs. The run servlet only provides report output. There are a few URL parameters for the BIRT Web Viewer, such as: __format : which is the output format, either HTML or PDF. __isnull: which sets a Report Parameter to null, parameter name as a value. __locale: which is the reports locale. __report: which is the report design file to run. __document: which is the report document file to open. Any remaining URL parameter will be treated as a Report Parameter. In the following image, I am running the Employee_Sales_Percentage.rptdesign file with the startDate and endDate parameters set.

0
0
6279

article-image-iteration-and-searching-keys

Packt

13 Nov 2013

7 min read

Iteration and Searching Keys

Packt

13 Nov 2013

7 min read

(For more resources related to this topic, see here.) Introducing Sample04 to show you loops and searches Sample04 uses the same LevelDbHelpers.h as before. Please download the entire sample and look at main04.cpp to see the code in context. Running Sample04 starts by printing the output from the entire database, as shown in the following screenshot: Console output of listing keys Creating test records with a loop The test data being used here was created with a simple loop and forms a linked list as well. It is explained in more detail in the Simple Relational Style section. The loop creating the test data uses the new C++11 range-based for style of the loop: vector<string> words {"Packt", "Packer", "Unpack", "packing","Packt2", "Alpacca"}; stringprevKey; WriteOptionssyncW; syncW.sync = true; WriteBatchwb; for (auto key : words) { wb.Put(key, prevKey + "tsome more content"); prevKey = key; } assert(db->Write(syncW, &wb).ok() ); Note how we're using a string to hang onto the prevKey. There may be a temptation to use a Slice here to refer to the previous value of key, but remember the warning about a Slice only having a data pointer. This would be a classic bug introduced with a Slice pointing to a value that can be changed underneath it! We're adding all the keys using a WriteBatch not just for consistency, but also so that the storage engine knows it's getting a bunch of updates in one go and can optimize the file writing. I will be using the term Record regularly from now on. It's easier to say than Key-value Pair and is also indicative of the richer, multi-value data we're storing. Stepping through all the records with iterators The model for multiple record reading in LevelDB is a simple iteration. Find a starting point and then step forwards or backwards. This is done with an Iterator object that manages the order and starting point of your stepping through keys and values. You call methods on Iterator to choose where to start, to step and to get back the key and value. Each Iterator gets a consistent snapshot of the database, ignoring updates during iteration. Create a new Iterator to see changes. If you have used declarative database APIs such as SQL-based databases, you would be used to performing a query and then operating on the results. Many of these APIs and older, record-oriented databases have a concept of a cursor which maintains the current position in the results which you can only move forward. Some of them allow you to move the cursor to the previous records. Iterating through individual records may seem clunky and old-fashioned if you are used to getting collections from servers. However, remember LevelDB is a local database. Each step doesn't represent a network operation! The iterable cursor approach is all that LevelDB offers, called an Iterator. If you want some way of mapping a collected set of results directly to a listbox or other containers, you will have to implement it on top of the Iterator, as we will see later. Iterating forwards, we just get an Iterator from our database and jump to the first record with SeekToFirst(): Iterator* idb = db->NewIterator(ropt); for (idb->SeekToFirst(); idb->Valid(); idb->Next()) cout<<idb->key() <<endl; Going backwards is very similar, but inherently less efficient as a storage trade-off: for (idb->SeekToLast(); idb->Valid(); idb->Prev()) cout<<idb->key() <<endl; If you wanted to see the value as well as the keys, just use the value() method on the iterator (the test data in Sample04 would make it look a bit confusing so it isn't being done here): cout<<idb->key() << " " <<idb->value() <<endl; Unlike some other programming iterators, there's no concept of a special forward or backward iterator and no obligation to keep going in the same direction. Consider searching an HR database for the ten highest-paid managers. With a key of Job+Salary, you would iterate through a range until you know you have hit the end of the managers, then iterate backwards to get the last ten. An iterator is created by NewIterator(), so you have to remember to delete it or it will leak memory. Iteration is over a consistent snapshot of the data, and any data changes through Put, Get, or Delete operations won't show until another NewIterator() is created. Searching for ranges of keys The second half of the console output is from our examples of iterating through partial keys, which are case-sensitive by default, with the default BytewiseComparator: Console output of searches As we've seen many times, the Get function looks for an exact match for a key. However, if you have an Iterator, you can use Seek and it will jump to the first key that either matches exactly or is immediately after the partial key you specify. If we are just looking for keys with a common prefix, the optimal comparison is using the starts_with method of the Slice class: Void listKeysStarting(Iterator* idb, const Slice& prefix) { cout<< "List all keys starting with " <<prefix.ToString() <<endl; for (idb->Seek(prefix); idb-<Valid() &&idb->key().starts_with(prefix); idb-<Next()) cout<<idb->key() <<endl; } Going backwards is a little bit more complicated. We use a key that is guaranteed to fail. You could think of it as being between the last key starting with our prefix and the next key out of the desired range. When we Seek to that key, we need to step once to the previous key. If that's valid and matching, it's the last key in our range: Void listBackwardsKeysStarting(Iterator* idb, const Slice& prefix) { cout<<"List all keys starting with " <<prefix.ToString() << " backwards " <<endl; const string keyAfter = prefix.ToString() + "xFF"; idb->Seek(keyAfter); if (idb->Valid()) idb->Prev(); // step to last key with actual prefix else // no key just after our range, but idb->SeekToLast(); // maybe the last key matches? for(;idb->Valid() &&idb->key().starts_with(prefix); idb->Prev()) cout<<idb->key() <<endl; } What if you want to get keys within a range? For the first time, I disagree with the documentation included with LevelDB. Their iteration example shows a similar loop to that shown in the following code, but checks the key values with idb->key().ToString() < limit. That is a more expensive way to iterate keys as it's generating a temporary string object for every key being checked, which is expensive if there were thousands in the range: Void listKeys Between(Iterator* idb, const Slice&startKey, const Slice&endKey) { cout<< "List all keys >= " <<startKey.ToString() << " and < " <<endKey.ToString() <<endl; for (idb->Seek(startKey); idb->Valid() &&idb->key().compare(endKey) < 0; idb->Next()) cout<<idb->key() <<endl; } We can use another built-in method of Slice; the compare() method, which returns a result <0, 0, or >0 to indicate if Slice is less than, equal to, or greater than the other Slice it is being compared to. This is the same semantics as the standard C memcpy. The code shown in the previous snippet will find keys that are the same, or after the startKey and are before the endKey. If you want the range to include the endKey, change the comparison to compare(endKey) <= 0. Summary In this article, we learned the concept of an iterator in LevelDB as a way to step through records sorted by their keys. The database became far more useful with searches to get the starting point for the iterator, and samples showing how to efficiently check keys as you step through a range. Resources for Article : Further resources on this subject: New iPad Features in iOS 6 [Article] Securing data at the cell level (Intermediate) [Article] Python Data Persistence using MySQL Part III: Building Python Data Structures Upon the Underlying Database Data [Article]

0
0
6278

article-image-creating-network-graphs-gephi

Packt

08 Oct 2013

10 min read

Creating Network Graphs with Gephi

Packt

08 Oct 2013

10 min read

(For more resources related to this topic, see here.) Basic network graph terminology Network graphs are essentially based on the construct of nodes and edges. Nodes represent points or entities within the data, while edges refer to the connections or lines between nodes. Individual nodes might be students in a school, or schools within an educational system, or perhaps agencies within a government structure. Individual nodes may be represented through equal sizes, but can also be depicted as smaller or larger based on the magnitude of a selected measure. For example, a node with many connections may be portrayed as far larger and thus more influential than a sparsely connected node. This approach will provide viewers with a visual cue that shows them where the highest (and lowest) levels of activity occur within a graph. Nodes will generally be positioned based on the similarity of their individual connections, leading to clusters of individual nodes within a larger network. In most network algorithms, nodes with higher levels of connections will also tend to be positioned near the center of the graph, while those with few connections will move toward the perimeter of the display. Edges are the connections between nodes, and may be displayed as undirected or directed paths. Undirected relationships indicate a connection that flows in both directions, while a directed relationship moves in a single direction that always originates in one node and moves toward another node. Undirected connections will tend to predominate in most cases, such as in social networks where participant activity flows in both directions. On occasion, we will see directed connections, as in the case of some transportation or technology networks where there are connections that flow in a single direction. Edges may also be weighted, to show the strength of the connection between nodes. In the case where University A has performed research with both University B and University C, the strength (width) of the edge will show viewers where the stronger relationship exists. If A and B have combined for three projects, while A and C have collaborated on 9 projects, we should weight the A to C connection three times that of the A to B connection. Another commonly used term is neighbors, which is nothing more than a node that is directly connected to a second node. Neighbors can be stated to be one degree apart. Degrees is the term used to refer to the number of connections flowing into (or away from) a node (also known as Degree Centrality), as well as to the number of connections required to connect to another node via the shortest possible path. In complex graphs, you may find nodes that are four, five, or even more degrees away from a distant node, and in some cases two nodes may not be connected at all. Now that you have a very basic understanding of network graph theory, let's learn about some of the common network graph algorithms. Common network graph algorithms Before we introduce you to some specific graph algorithms, we'll briefly discuss some of the theory behind network graphs and introduce you to a few of the terms you will frequently encounter. Network graphs are drawn through positioning nodes and their respective connections relative to one another. In the case of a graph with 8 or 10 nodes, this is a rather simple exercise, and could probably be drawn rather accurately without the help of complex methodologies. However, in the typical case where we have hundreds of nodes with thousands of edges, the task becomes far more complex. Some of the more prominent graph classes in Gephi include the following: Force-directed algorithms refer to a category of graphs that position elements based on the principles of attraction, repulsion, and gravity Circular algorithms position graph elements around the perimeter of one or more circles, and may allow the user to dictate the order of the elements based on data properties Dual circle layouts position a subset of nodes in the interior of the graph with the remaining nodes around the diameter, similar to a single circular graph Concentric layouts arrange the graph nodes using an approximately circular graph design, with less connected nodes at the perimeter of the graph and highly connected nodes typically near the center Radial axis layouts provide the user with the ability to determine some portion of the graph layout by defining graph attributes The type of graph you select may well be dictated by the sort of results you seek. If you wish to feature certain groups within your dataset, one of the layouts that allows you to segment the graph by groups will provide a potentially quick solution to your needs. In this instance, one of the circular or radial axis graphs may prove ideal. On the other hand, if you are attempting to discover relationships in a complex new dataset, one of the several available Force-directed layouts is likely a better choice. These algorithms will rely on the values in your dataset to determine the positioning within the graph. When choosing one of these approaches, please note that there will often be an extensive runtime to calculate the graph layout, especially as the data becomes more complex. Even on a powerful computer, examples may run for minutes or hours in an attempt to fully optimize the graph. Fortunately, you will have the ability in Gephi to stop these algorithms at any given point, and you will still have a very reasonable, albeit imperfect graph. In the next section, we'll look at a few of the standard layouts that are part of the Gephi base package. Standard network graph layouts Now that you are somewhat familiar with the types of layout algorithms, we'll take a look at what Gephi offers within the Layout tab. We'll begin with some common Force-directed approaches, and then examine some of the other choices. One of the best known force algorithms is Force Atlas, which in Gephi provides users with multiple options for drawing the graph. Foremost among these settings are Repulsion, Attraction, and Gravity settings. Repulsion strength adjustments will make individual nodes either more or less sensitive to other nodes they differ from. A higher repulsion level, for example, will push these nodes further apart. Conversely, setting the Attraction strength higher will force related nodes closer together. Finally, the Gravity setting will draw nodes closer to the center of the graph if it is set to a high level, and disperse them toward the edges if a very low value is set. Force Atlas 2 is another layout option that employs slightly different parameters than the original Force Atlas method. You may wish to compare these methods and determine which one gives you better results. Fruchterman Reingold is one more Force method; albeit one that provides you with just three parameters – Area, Gravity, and Speed. While the general approach is not unlike the Force Atlas algorithms, your results will appear different in a visual sense. Finally, Gephi provides three Yifan Hu methods. Each of these models—Yifan Hu, Yifan Hu Proportional, and Yifan Hu Multilevel, are likely to run much more rapidly than the methods discussed earlier, while providing generally similar results. Gephi also provides a variety of methods that do not employ the force approach. Some of the models, as we noted earlier in this article, provide you with more control over the final result. This may be the result of selecting how to order the nodes, or of which attributes to use in grouping nodes together, either through color or location. In the section above, I referenced several layout options, but in the interest of space we'll take a closer look at two of them—the Circular and Radial Axis layouts. Circular layouts are well suited to relatively small networks, given the limited flexibility of their fixed layout. We can adjust this to some degree by specifying the diameter of the graph, but anything more than a few dozen well-connected nodes often becomes difficult to manage. However, with smaller networks, these layouts can be intriguing, providing us with the ability to see patterns within and between specific groups more easily than we might see them in some other layouts. While this article will not cover any filtering options, those too can be used to help us better utilize the circular layouts, by providing us with the ability to highlight specific groups and their resulting connections. Think of the circle resembling a giant spider web filled with connections, and the filters as tools that help us see specific threads within the web. Our final notes are on Radial Axis layouts, which can provide us with fascinating looks at our data, especially if there are natural groups within the network. Think of a school with several classrooms full of students, for example. Each of these classrooms can be easily identified and grouped, perhaps by color. In a complex force directed graph we may be able to spot each of these groups, but it may become difficult due to the interactions with other classes. In a Radial Axis layout we can dictate the group definitions, forcing each group to be bound together, apart from any other groups. There are pros and cons to this approach, of course, as there are with any of the other methods. If we wish to understand how a specific group interacts with another group, this method can prove beneficial, as it isolates these groups visually, making it easier to see connections between them. On the negative side, it is often quite difficult to see connections between members within the group, due to the nature of the layout. As with any layouts, it is critical to look at the results and see how they apply to our original need. Always test your data using multiple layout algorithms, so that you wind up with the best possible approach. Summary Gephi is an ideal tool for users new to network graph analysis and visualization, as it provides a rich set of tools to create and customize network graphs. The user interface makes it easy to understand basic concepts such as nodes and edges, as well as descriptive terminology such as neighbors, degrees, repulsion, and attraction. New users can move as slowly or as rapidly as they wish, given Gephi's gentle learning curve. Gephi can also help you see and understand patterns within your data through a variety of sophisticated graph methods that will appeal to both the novice as well as seasoned users. The variety of sophisticated layout algorithms will provide you the opportunity to experiment with multiple layouts as you search for the best approach to display your data. In short, Gephi provides everything needed to produce first-rate network visualizations. Resources for Article: Further resources on this subject: OpenSceneGraph: Advanced Scene Graph Components [Article] Cacti: Using Graphs to Monitor Networks and Devices [Article] OpenSceneGraph: Managing Scene GraphOpenSceneGraph: Managing Scene Graph [Article]

0
0
6270

article-image-disaster-recovery-mysql-python

Packt

25 Sep 2010

10 min read

Disaster Recovery in MySQL for Python

Packt

25 Sep 2010

10 min read

MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server Read more about this book (For more resources on Phython, see here.) The purpose of the archiving methods covered in this article is to allow you, as the developer, to back up databases that you use for your work without having to rely on the database administrator. As noted later in the article, there are more sophisticated methods for backups than we cover here, but they involve system-administrative tasks that are beyond the remit of any development post and are thus beyond the scope of this article. Every database needs a backup plan When archiving a database, one of the critical questions that must be answered is how to take a snapshot backup of the database without having users change the data in the process. If data changes in the midst of the backup, it results in an inconsistent backup and compromises the integrity of the archive. There are two strategic determinants for backing up a database system: Offline backups Live backups Which you use depends on the dynamics of the system in question and the import of the data being stored. In this article, we will look at each in turn and the way to implement them. Offline backups Offline backups are done by shutting down the server so the records can be archived without the fear of them being changed by the user. It also helps to ensure the server shut down gracefully and that errors were avoided. The problem with using this method on most production systems is that it necessitates a temporary loss of access to the service. For most service providers, such a consequence is anathema to the business model. The value of this method is that one can be certain that the database has not changed at all while the backup is run. Further, in many cases, the backup is performed faster because the processor is not simultaneously serving data. For this reason, offline backups are usually performed in controlled environments or in situations where disruption is not critical to the user. These include internal databases, where administrators can inform all users about the disruption ahead of time, and small business websites that do not receive a lot of traffic. Offline backups also have the benefit that the backup is usually held in a single file. This can then be used to copy a database across hosts with relative ease. Shutting down a server obviously requires system administrator-like authority. So creating an offline backup relies on the system administrator shutting down the server. If your responsibilities include database administration, you will also have sufficient permission to shut down the server. Live backups Live backups occur while the server continues to accept queries from users, while it's still online. It functions by locking down the tables so no new data may be written to them. Users usually do not lose access to the data and the integrity of the archive, for a particular point in time is assured. Live backups are used by large, data-intensive sites such as Nokia's Ovi services and Google's web services. However, because they do not always require administrator access of the server itself, these tend to suit the backup needs of a development project. Choosing a backup method After having determined whether a database can be stopped for the backup, a developer can choose from three methods of archiving: Copying the data files (including administrative files such as logs and tablespaces) Exporting delimited text files Backing up with command-line programs Which you choose depends on what permissions you have on the server and how you are accessing the data. MySQL also allows for two other forms of backup: using the binary log and by setting up replication (using the master and slave servers). To be sure, these are the best ways to back up a MySQL database. But, both of these are administrative tasks and require system-administrator authority; they are not typically available to a developer. However, you can read more about them in the MySQL documentation. Use of the binary log for incremental backups is documented at: http://dev.mysql.com/doc/refman/5.5/en/point-in-time-recovery.html Setting up replication is further dealt with at: http://dev.mysql.com/doc/refman/5.5/en/replication-solutions-backups.html Copying the table files The most direct way to back up database files is to copy from where MySQL stores the database itself. This will naturally vary based on platform. If you are unsure about which directory holds the MySQL database files, you can query MySQL itself to check: mysql> SHOW VARIABLES LIKE 'datadir'; Alternatively, the following shell command sequence will give you the same information: $ mysqladmin variables | grep datadir| datadir | /var/lib/mysql/ | Note that the location of administrative files, such as binary logs and InnoDB tablespaces are customizable and may not be in the data directory. If you do not have direct access to the MySQL server, you can also write a simple Python program to get the information: #!/usr/bin/env pythonimport MySQLdbmydb = MySQLdb.connect('<hostname>', '<user>', '<password>')cursor = mydb.cursor()runit = cursor.execute("SHOW VARIABLES LIKE 'datadir'")results = cursor.fetchall()print "%s: %s" %(cursor.fetchone()) Slight alteration of this program will also allow you to query several servers automatically. Simply change the login details and adapt the output to clarify which data is associated with which results. Locking and flushing If you are backing up an offline MyISAM system, you can copy any of the files once the server has been stopped. Before backing up a live system, however, you must lock the tables and flush the log files in order to get a consistent backup at a specific point. These tasks are handled by the LOCK TABLES and FLUSH commands respectively. When you use MySQL and its ancillary programs (such as mysqldump) to perform a backup, these tasks are performed automatically. When copying files directly, you must ensure both are done. How you apply them depends on whether you are backing up an entire database or a single table. LOCK TABLES The LOCK TABLES command secures a specified table in a designated way. Tables can be referenced with aliases using AS and can be locked for reading or writing. For our purposes, we need only a read lock to create a backup. The syntax looks like this: LOCK TABLES <tablename> READ; This command requires two privileges: LOCK TABLES and SELECT. It must be noted that LOCK TABLES does not lock all tables in a database but only one. This is useful for performing smaller backups that will not interrupt services or put too severe a strain on the server. However, unless you automate the process, manually locking and unlocking tables as you back up data can be ridiculously inefficient. FLUSH The FLUSH command is used to reset MySQL's caches. By re-initiating the cache at the point of backup, we get a clear point of demarcation for the database backup both in the database itself and in the logs. The basic syntax is straightforward, as follows: FLUSH <the object to be reset>; Use of FLUSH presupposes the RELOAD privilege for all relevant databases. What we reload depends on the process we are performing. For the purpose of backing up, we will always be flushing tables: FLUSH TABLES; How we "flush" the tables will depend on whether we have already used the LOCK TABLES command to lock the table. If we have already locked a given table, we can call FLUSH for that specific table: FLUSH TABLES <tablename>; However, if we want to copy an entire database, we can bypass the LOCK TABLES command by incorporating the same call into FLUSH: FLUSH TABLES WITH READ LOCK; This use of FLUSH applies across the database, and all tables will be subject to the read lock. If the account accessing the database does not have sufficient privileges for all databases, an error will be thrown. Unlocking the tables Once you have copied the files for a backup, you need to remove the read lock you imposed earlier. This is done by releasing all locks for the current session: UNLOCK TABLES; Restoring the data Restoring copies of the actual storage files is as simple as copying them back into place. This is best done when MySQL has stopped, lest you risk corruption. Similarly, if you have a separate MySQL server and want to transfer a database, you simply need to copy the directory structure from the one server to another. On restarting, MySQL will see the new database and treat it as if it had been created natively. When restoring the original data files, it is critical to ensure the permissions on the files and directories are appropriate and match those of the other MySQL databases. Delimited backups within MySQL MySQL allows for exporting of data from the MySQL command line. To do so, we simply direct the output from a SELECT statement to an output file. Using SELECT INTO OUTFILE to export data Using sakila, we can save the data from film to a file called film.data as follows: SELECT * INTO OUTFILE 'film.data' FROM film; This results in the data being written in a tab-delimited format. The file will be written to the directory in which MySQL stores the sakila data. Therefore, the account under which the SELECT statement is executed must have the FILE privilege for writing the file as well as login access on the server to view it or retrieve it. The OUTFILE option on SELECT can be used to write to any place on the server that MySQL has write permission to use. One simply needs to prepend that directory location to the file name. For example, to write the same file to the /tmp directory on a Unix system, use: SELECT * INTO OUTFILE '/tmp/film.data' FROM film; Windows simply requires adjustment of the directory structure accordingly. Using LOAD DATA INFILE to import data If you have an output file or similar tab-delimited file and want to load it into MySQL, use the LOAD DATA INFILE command. The basic syntax is: LOAD DATA INFILE '<filename>' INTO TABLE <tablename>; For example, to import the film.data file from the /tmp directory into another table called film2, we would issue this command: LOAD DATA INFILE '/tmp/film.data' INTO TABLE film2; Note that LOAD DATA INFILE presupposes the creation of the table into which the data is being loaded. In the preceding example, if film2 had not been created, we would receive an error. If you are trying to mirror a table, remember to use the SHOW CREATE TABLE query to save yourself time in formulating the CREATE statement. This discussion only touches on how to use LOAD DATA INFILE for inputting data created with the OUTFILE option of SELECT. But, the command handles text files with just about any set of delimiters. To read more on how to use it for other file formats, see the MySQL documentation at: http://dev.mysql.com/doc/refman/5.5/en/load-data.html

0
0
6243

article-image-understanding-model-based-clustering

Packt

14 Sep 2015

10 min read

Understanding Model-based Clustering

Packt

14 Sep 2015

10 min read

0
0
6232

article-image-regular-expressions-python-26-text-processing

Packt

13 Jan 2011

17 min read

Regular Expressions in Python 2.6 Text Processing

Packt

13 Jan 2011

17 min read

0
0
6212

article-image-make-things-pretty-ggplot2

Packt

24 Feb 2016

30 min read

Make Things Pretty with ggplot2

Packt

24 Feb 2016

30 min read

0
0
6191

Packt

29 Oct 2015

32 min read

Protecting Your Bitcoins

Packt

29 Oct 2015

32 min read

0
0
6188

article-image-linking-section-access-multiple-dimensions

Packt

25 Jun 2013

3 min read

Linking Section Access to multiple dimensions

Packt

25 Jun 2013

3 min read

(For more resources related to this topic, see here.) Getting ready Load the following script: Product:LOAD * INLINE [ ProductID, ProductGroup, ProductName 1, GroupA, Great As 2, GroupC, Super Cs 3, GroupC, Mega Cs 4, GroupB, Good Bs 5, GroupB, Busy Bs];Customer:LOAD * INLINE [ CustomerID, CustomerName, Country 1, Gatsby Gang, USA 2, Charly Choc, USA 3, Donnie Drake, USA 4, London Lamps, UK 5, Shylock Homes, UK];Sales:LOAD * INLINE [ CustomerID, ProductID, Sales 1, 2, 3536 1, 3, 4333 1, 5, 2123 2, 2, 45562, 4, 1223 2, 5, 6789 3, 2, 1323 3, 3, 3245 3, 4, 6789 4, 2, 2311 4, 3, 1333 5, 1, 7654 5, 2, 3455 5, 3, 6547 5, 4, 2854 5, 5, 9877];CountryLink:Load Distinct Country, Upper(Country) As COUNTRY_LINKResident Customer;Load Distinct Country, 'ALL' As COUNTRY_LINKResident Customer;ProductLink:Load Distinct ProductGroup, Upper(ProductGroup) As PRODUCT_LINKResident Product;Load Distinct ProductGroup, 'ALL' As PRODUCT_LINKResident Product;//Section Access;Access:LOAD * INLINE [ ACCESS, USERID, PRODUCT_LINK, COUNTRY_LINKADMIN, ADMIN, *, * USER, GM, ALL, ALL USER, CM1, ALL, USA USER, CM2, ALL, UK USER, PM1, GROUPA, ALL USER, PM2, GROUPB, ALL USER, PM3, GROUPC, ALL USER, SM1, GROUPB, UK USER, SM2, GROUPA, USA];Section Application; Note that there is a loop error generated on reload because there is a loop in the data structure. How to do it… Follow these steps to link Section Access to multiple dimensions: Add list boxes to the layout for ProductGroup and Country. Add a statistics box for Sales. Remove // to uncomment the Section Access statement. From the Settings menu, open Document Properties and select the Opening tab. Turn on the Initial Data Reduction Based on Section Access option. Reload and save the document. Close QlikView. Re-open QlikView and open the document. Log in as the Country Manager, CM1, user. Note that USA is the only country. Also, the product group, GroupA, is missing—there are no sales of this product group in USA. Close QlikView and then re-open again. This time, log in as the Sales Manager, SM2. You will not be allowed access to the document. Log into the document as the ADMIN user. Edit the script. Add a second entry for the SM2 user in the Access table as follows: USER, SM2, GROUPA, USA USER, SM2, GROUPB, UK Reload, save, and close the document and QlikView. Re-open and log in as SM2. Note the selections. How it works… Section Access is really quite simple. The user is connected to the data and the data is reduced accordingly. QlikView allows Section Access tables to be connected to multiple dimensions in the main data structure without causing issues with loops. Each associated field acts in the same way as a selection in the layout. The initial setting for the SM2 user contained values that were mutually exclusive. Because of the default Strict Exclusion setting, the SM2 user cannot log in. We changed the script and included multiple rows for the SM2 user. Intuitively, we might expect that, as the first row did not connect to the data, only the second row would connect to the data. However, each field value is treated as an individual selection and all of the values are included. There's more… If we wanted to include solely the composite association of Country and ProductGroup, we would need to derive a composite key in the data set and connect the user to that. In this example, we used the USERID field to test using QlikView logins. However, we would normally use NTNAME to link the user to either a Windows login or a custom login. Resources for Article : Further resources on this subject: Pentaho Reporting: Building Interactive Reports in Swing [Article] Visual ETL Development With IBM DataStage [Article] A Python Multimedia Application: Thumbnail Maker [Article]

0
0
6166

Packt

21 Jun 2011

7 min read

Segmenting images in OpenCV

Packt

21 Jun 2011

7 min read

OpenCV 2 Computer Vision Application Programming Cookbook Over 50 recipes to master this library of programming functions for real-time computer vision Read more about this book OpenCV (Open Source Computer Vision) is an open source library containing more than 500 optimized algorithms for image and video analysis. Since its introduction in 1999, it has been largely adopted as the primary development tool by the community of researchers and developers in computer vision. OpenCV was originally developed at Intel by a team led by Gary Bradski as an initiative to advance research in vision and promote the development of rich, vision-based CPU-intensive applications. In the previous article by Robert Laganière, author of OpenCV 2 Computer Vision Application Programming Cookbook, we took a look at image processing using morphological filters. In this article we will see how to segment images using watersheds and GrabCut algorithm. (For more resources related to this subject, see here.) Segmenting images using watersheds The watershed transformation is a popular image processing algorithm that is used to quickly segment an image into homogenous regions. It relies on the idea that when the image is seen as a topological relief, homogeneous regions correspond to relatively flat basins delimitated by steep edges. As a result of its simplicity, the original version of this algorithm tends to over-segment the image which produces multiple small regions. This is why OpenCV proposes a variant of this algorithm that uses a set of predefined markers which guide the definition of the image segments. How to do it... The watershed segmentation is obtained through the use of the cv::watershed function. The input to this function is a 32-bit signed integer marker image in which each non-zero pixel represents a label. The idea is to mark some pixels of the image that are known to certainly belong to a given region. From this initial labeling, the watershed algorithm will determine the regions to which the other pixels belong. In this recipe, we will first create the marker image as a gray-level image, and then convert it into an image of integers. We conveniently encapsulated this step into a WatershedSegmenter class: class WatershedSegmenter { private: cv::Mat markers;public: void setMarkers(const cv::Mat& markerImage) { // Convert to image of ints markerImage.convertTo(markers,CV_32S); } cv::Mat process(const cv::Mat &image) { // Apply watershed cv::watershed(image,markers); return markers; } The way these markers are obtained depends on the application. For example, some preprocessing steps might have resulted in the identification of some pixels belonging to an object of interest. The watershed would then be used to delimitate the complete object from that initial detection. In this recipe, we will simply use the binary image used in the previous article (OpenCV: Image Processing using Morphological Filters) in order to identify the animals of the corresponding original image. Therefore, from our binary image, we need to identify pixels that certainly belong to the foreground (the animals) and pixels that certainly belong to the background (mainly the grass). Here, we will mark foreground pixels with label 255 and background pixels with label 128 (this choice is totally arbitrary, any label number other than 255 would work). The other pixels, that is the ones for which the labeling is unknown, are assigned value 0. As it is now, the binary image includes too many white pixels belonging to various parts of the image. We will then severely erode this image in order to retain only pixels belonging to the important objects: // Eliminate noise and smaller objectscv::Mat fg;cv::erode(binary,fg,cv::Mat(),cv::Point(-1,-1),6); The result is the following image: Note that a few pixels belonging to the background forest are still present. Let's simply keep them. Therefore, they will be considered to correspond to an object of interest. Similarly, we also select a few pixels of the background by a large dilation of the original binary image: // Identify image pixels without objectscv::Mat bg;cv::dilate(binary,bg,cv::Mat(),cv::Point(-1,-1),6);cv::threshold(bg,bg,1,128,cv::THRESH_BINARY_INV); The resulting black pixels correspond to background pixels. This is why the thresholding operation immediately after the dilation assigns to these pixels the value 128. The following image is then obtained: These images are combined to form the marker image: // Create markers imagecv::Mat markers(binary.size(),CV_8U,cv::Scalar(0));markers= fg+bg; Note how we used the overloaded operator+ here in order to combine the images. This is the image that will be used as input to the watershed algorithm: The segmentation is then obtained as follows: // Create watershed segmentation objectWatershedSegmenter segmenter;// Set markers and processsegmenter.setMarkers(markers);segmenter.process(image); The marker image is then updated such that each zero pixel is assigned one of the input labels, while the pixels belonging to the found boundaries have value -1. The resulting image of labels is then: The boundary image is: How it works... As we did in the preceding recipe, we will use the topological map analogy in the description of the watershed algorithm. In order to create a watershed segmentation, the idea is to progressively flood the image starting at level 0. As the level of "water" progressively increases (to levels 1, 2, 3, and so on), catchment basins are formed. The size of these basins also gradually increase and, consequently, the water of two different basins will eventually merge. When this happens, a watershed is created in order to keep the two basins separated. Once the level of water has reached its maximal level, the sets of these created basins and watersheds form the watershed segmentation. As one can expect, the flooding process initially creates many small individual basins. When all of these are merged, many watershed lines are created which results in an over-segmented image. To overcome this problem, a modification to this algorithm has been proposed in which the flooding process starts from a predefined set of marked pixels. The basins created from these markers are labeled in accordance with the values assigned to the initial marks. When two basins having the same label merge, no watersheds are created, thus preventing the oversegmentation. This is what happens when the cv::watershed function is called. The input marker image is updated to produce the final watershed segmentation. Users can input a marker image with any number of labels with pixels of unknown labeling left to value 0. The marker image has been chosen to be an image of a 32-bit signed integer in order to be able to define more than 255 labels. It also allows the special value -1, to be assigned to pixels associated with a watershed. This is what is returned by the cv::watershed function. To facilitate the displaying of the result, we have introduced two special methods. The first one returns an image of the labels (with watersheds at value 0). This is easily done through thresholding: // Return result in the form of an imagecv::Mat getSegmentation() { cv::Mat tmp; // all segment with label higher than 255 // will be assigned value 255 markers.convertTo(tmp,CV_8U); return tmp;} Similarly, the second method returns an image in which the watershed lines are assigned value 0, and the rest of the image is at 255. This time, the cv::convertTo method is used to achieve this result: // Return watershed in the form of an imagecv::Mat getWatersheds() { cv::Mat tmp; // Each pixel p is transformed into // 255p+255 before conversion markers.convertTo(tmp,CV_8U,255,255); return tmp;} The linear transformation that is applied before the conversion allows -1 pixels to be converted into 0 (since -1*255+255=0). Pixels with a value greater than 255 are assigned the value 255. This is due to the saturation operation that is applied when signed integers are converted into unsigned chars. See also The article The viscous watershed transform by C. Vachier, F. Meyer, Journal of Mathematical Imaging and Vision, volume 22, issue 2-3, May 2005, for more information on the watershed transform. The next recipe which presents another image segmentation algorithm that can also segment an image into background and foreground objects.

0
0
6135

article-image-oracle-goldengate-11g-performance-tuning

Packt

01 Mar 2011

12 min read

Oracle GoldenGate 11g: Performance Tuning

Packt

01 Mar 2011

12 min read

0
0
6095

How-To Tutorials - Data

How to use Standard Macro in Workflows

The scripting Capabilities of Elasticsearch

Basic Concepts and Architecture of Cassandra

Building A Search Geo Locator with Elasticsearch and Spark

Deployment of Reports with BIRT

Iteration and Searching Keys

Creating Network Graphs with Gephi

Disaster Recovery in MySQL for Python

Understanding Model-based Clustering

Regular Expressions in Python 2.6 Text Processing

Trending Topics

Make Things Pretty with ggplot2

Protecting Your Bitcoins

Linking Section Access to multiple dimensions

Segmenting images in OpenCV

Oracle GoldenGate 11g: Performance Tuning

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access