Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-how-to-use-standard-macro-in-workflows
Sunith Shetty
21 Feb 2018
6 min read
Save for later

How to use Standard Macro in Workflows

Sunith Shetty
21 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Renato Baruti titled Learning Alteryx. In this book, you will learn how to perform self-analytics and create interactive dashboards using various tools in Alteryx.[/box] Today we will learn Standard Macro that will provide you with a foundation for building enhanced workflows. The csv file required for this tutorial is available to download here. Standard Macro Before getting into Standard Macro, let's define what a macro is. A macro is a collection of workflow tools that are grouped together into one tool. Using a range of different interface tools, a macro can be developed and used within a workflow. Any workflow can be turned into a macro and a repeatable element of a workflow can commonly be converted into a macro. There are a couple of ways you can turn your workflow into a Standard Macro. The first is to go to the canvas configuration pane and navigate to the Workflow tab. This is where you select what type of workflow you want. If you select Macro you should then have Standard Macro automatically selected. Now, when you save this workflow it will save as a macro. You’ll then be able to add it to another workflow and run the process created within the macro itself. The second method is just to add a Macro Input tool from the Interface tool section onto the canvas; the workflow will then automatically change to a Standard Macro. The following screenshot shows the selection of a Standard Macro, under the Workflow tab: Let's go through an example of creating and deploying a standard macro. Standard Macro Example #1: Create a macro that allows the user to input a number used as a multiplier. Use the multiplier for the DataValueAlt field. The following steps demonstrate this process: Step 1: Select the Macro Input tool from the Interface tool palette and add the tool onto the canvas. The workflow will automatically change to a Standard Macro. Step 2: Select Text Input and Edit Data option within the Macro Input tool configuration. Step 3: Create a field called Number and enter the values: 155, 243, 128, 352, and 357 in each row, as shown in the following image: Step 4: Rename the Input Name Input and set the Anchor Abbreviation as I as shown in the following image: Step 5: Select the Formula tool from the Preparation tool palette. Connect the Formula tool to the Macro Input tool. Step 6: Select the + Add Column option in the Select Column drop down within the Formula tool configuration. Name the field Result. Step 7: Add the following expression to the expression window: [Number]*0.50 Step 8: Select the Macro Output tool from the Interface tool palette and add the tool onto the canvas. Connect the Macro Output tool to the Formula tool. Step 9: Rename the Output Name Output and set the Anchor Abbreviation as O: The Standard Macro has now been created. It can be saved to use as multiplier, to calculate the five numbers added within the Macro Input tool to multiply 0.50. This is great; however, let's take it a step further to make it dynamic and flexible by allowing the user to enter a multiplier. For instance, currently the multiplier is set to 0.50, but what if a user wants to change that to 0.25 or 0.10 to determine the 25% or 10% value of a field. Let's continue building out the Standard Macro to make this possible. Step 1: Select the Text Box tool from the Interface tool palette and drag it onto the canvas. Connect the Text Box tool to the Formula tool on the lightning bolt (the macro indicator). The Action tool will automatically be added to the canvas, as this automatically updates the configuration of a workflow with values provided by interface questions when run as an app or macro. Step 2: Configure the Action tool that will automatically update the expression replaced by a specific field. Select Formula | FormulaFields | FormulaField | @expression - value="[Number]*0.50". Select the Replace a specific string: option and enter 0.50. This is where the automation happens, updating the 0.50 to any number the user enters. You will see how this happens in the following steps: Step 3: In the Enter the text or question to be displayed text box, within the Text Box tool configuration, enter: Please enter a number: Step 4: Save the workflow as Standard Macro.yxmc. The .yxmc file type indicates it's a macro related workflow, as shown in the following image: Step 5: Open a new workflow. Step 6: Select the Input Data tool from the In/Out tool palette and connect to the U.S. Chronic Disease Indicators.csv file. Step 7: Select the Select tool from the Preparation tool palette and drag it onto the canvas. Connect the Select tool to the Input Data tool. Step 8: Change the Data Type for the DataValueAlt field to Double. Step 9: Right-click on the canvas and select Insert | Macro | Standard Macro. Step 10: Connect the Standard Macro to the Select tool. Step 11: There will be Questions to select within the Standard Macro tool configuration. Select DataValueAlt (Double) as the Choose Field option and enter 0.25 in the Please enter a number text box: Step 12: Add a Browse tool to the Standard Macro tool. Step 13: Run the workflow: The goal for creating this Standard Macro was to allow the user to select what they would like the multiplier to be rather than a static number. Let's recap what has been created and deployed using a Standard Macro. First, Standard Macro.yxmc was developed using Interface tools. The Macro Input (I) was used to enter sample text data for the Number field. This Number field is what is used to multiply to what the given multiplier is - in this case, 0.50. This is the static number multiplier. The Formula tool was used to create the expression to conclude that the Number field will be multiplied by 0.50. The Macro Output (O) was used to output the macro so that it can be used in another workflow. The Text Box tool is where the question Please enter a number will be displayed, along with the Action tool that is used to update the specific value replaced. The current multiplier, 0.50, is replaced by 0.25, as identified in step 20, through a dynamic input by which the user can enter the multiplier. Notice that, in the Browse tool output, the Result field has been added, multiplying the values for the DataValueAlt field to the multiplier 0.25. Change the value in the macro to 0.10 and run the workflow. The Result field has been updated to now multiple the values for the DataValueAlt field to the multiplier 0.10. This is a great use case of a Standard Macro and demonstrates how versatile the Interface tools are. We learned about macros and their dynamic use within workflows. We saw how Standard Macro was developed to allow the end user to specify what they want the multiplier to be. This is a great way to implement the interactivity within a workflow. To know more about high-quality interactive dashboards and efficient self-service data analytics, do checkout this book Learning Alteryx.  
Read more
  • 0
  • 0
  • 6426

article-image-scripting-capabilities-elasticsearch
Packt
08 Jan 2016
19 min read
Save for later

The scripting Capabilities of Elasticsearch

Packt
08 Jan 2016
19 min read
In this article by Rafał Kuć and Marek Rogozinski author of the book Elasticsearch Server - Third Edition, Elasticsearch has a few functionalities in which scripts can be used. Even though scripts seem to be a rather advanced topic, we will look at the possibilities offered by Elasticsearch. That's because scripts are priceless in certain situations. Elasticsearch can use several languages for scripting. When not explicitly declared, it assumes that Groovy (http://www.groovy-lang.org/) is used. Other languages available out of the box are the Lucene expression language and Mustache (https://mustache.github.io/). Of course, we can use plugins that will make Elasticsearch understand additional scripting languages such as JavaScript, Mvel, or Python. One thing worth mentioning is this: independently from the scripting language that we will choose, Elasticsearch exposes objects that we can use in our scripts. Let's start by briefly looking at what type of information we are allowed to use in our scripts. (For more resources related to this topic, see here.) Objects available during script execution During different operations, Elasticsearch allows us to use different objects in our scripts. To develop a script that fits our use case, we should be familiar with those objects. For example, during a search operation, the following objects are available: _doc (also available as doc): An instance of the org.elasticsearch.search.lookup.LeafDocLookup object. It gives us access to the current document found with the calculated score and field values. _source: An instance of the org.elasticsearch.search.lookup.SourceLookup object. It provides access to the source of the current document and the values defined in the source. _fields: An instance of the org.elasticsearch.search.lookup.LeafFieldsLookup object. It can be used to access the values of the document fields. On the other hand, during a document update operation, the variables mentioned above are not accessible. Elasticsearch exposes only the ctx object with the _source property, which provides access to the document currently processed in the update request. As we have previously seen, several methods are mentioned in the context of document fields and their values. Let's now look at the examples of how to get the value for a particular field using the previously mentioned object available during search operations. In the brackets, you can see what Elasticsearch will return for one of our example documents from the library index (we will use the document with identifier 4): _doc.title.value (and) _source.title (crime and punishment) _fields.title.value (null) A bit confusing, isn't it? During indexing, the original document is, by default, stored in the _source field. Of course, by default, all fields are present in that _source field. In addition to this, the document is parsed, and every field may be stored in an index if it is marked as stored (that is, if the store property is set to true; otherwise, by default, the fields are not stored). Finally, the field value may be configured as indexed. This means that the field value is analyzed and placed in the index. To sum up, one field may land in an Elasticsearch index in the following ways: As part of the _source document As a stored and unparsed original value As an indexed value that is processed by an analyzer In scripts, we have access to all of these field representations. The only exception is the update operation, which—as we've mentioned before—gives us access to  only the _source document as part of the ctx variable. You may wonder which version you should use. Well, if we want access to the processed form, the answer would be simple—use the _doc object. What about _source and _fields? In most cases, _source is a good choice. It is usually fast and needs fewer disk operations than reading the original field values from the index. This is especially true when you need to read values of multiple fields in your scripts—fetching a single _source field is faster than fetching multiple independent fields from the index. Script types Elasticsearch allows us to use scripts in three different ways: Inline scripts: The source of the script is directly defined in the query In-file scripts: The source is defined in the external file placed in the Elasticsearch config/scripts directory As a document in the dedicated index: The source of the script is defined as a document in a special index available by using the /_scripts API endpoint Choosing the way of defining scripts depends on several factors. If you have scripts that you will use in many different queries, the file or the dedicated index seems to be the best solution. "Scripts in the file" is probably less convenient, but it is preferred from the security point of view—they can't be overwritten and injected into your query, which might have caused a security breach. In-file scripts This is the only way that is turned on by default in Elasticsearch. The idea is that every script used by the queries is defined in its own file placed in the config/scripts directory. We will now look at this method of using scripts. Let's create an example file called tag_sort.groovy and place it in the config/scripts directory of our Elasticsearch instance (or instances if we are running a cluster). The content of the mentioned file should look like this: _doc.tags.values.size() > 0 ? _doc.tags.values[0] : 'u19999' After a few seconds, Elasticsearch should automatically load a new file. You should see something like this in the Elasticsearch logs: [2015-08-30 13:14:33,005][INFO ][script                   ] [Alex Wilder] compiling script file [/Users/negativ/Developer/ES/es-current/config/scripts/tag_sort.groovy] If you have a multinode cluster, you have to make sure that the script is available on every node. Now we are ready to use this script in our queries. A modified query that uses our script stored in the file looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "file" : "tag_sort"        },        "type" : "string",        "order" : "asc"      }   } }' First, we will see the next possible way of defining a script inline. Inline scripts Inline scripts are a more convenient way of using scripts, especially for constantly changing queries or ad-hoc queries. The main drawback of such an approach is security. If we do this, we allow users to run any kind of query, including any kind of script that can be used by attackers. Such an attack can execute arbitrary code on the server running Elasticsearch with rights equal to the ones given to the user who is running Elasticsearch. In the worst-case scenario, an attacker could use security holes to gain superuser rights. This is why inline scripts are disabled by default. After careful consideration, you can enable them by adding this to the elasticsearch.yml file: script.inline: on After allowing the inline script to be executed, we can run a query that looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "inline" : "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999""        },        "type" : "string",        "order" : "asc"      }   } }' Indexed scripts The last option for defining scripts is to store them in the dedicated Elasticsearch index. From the same security reasons, dynamic execution of indexed scripts is by default disabled. To enable indexed scripts, we have to add a configuration similar option to the one that we've added to be able to use inline scripts. We need to add the following line to the elasticsearch.yml file: script.indexed: on After adding the above property to all the nodes and restarting the cluster, we will be ready to start using indexed scripts. Elasticsearch provides additional dedicated endpoints for this purpose. Let's store our script: curl -XPOST 'localhost:9200/_scripts/groovy/tag_sort' -d '{   "script" :  "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999"" }' The script is ready, but let's discuss what we just did. We sent an HTTP POST request to the special _scripts REST endpoint. We also specified the language of the script (groovy in our case) and the name of the script (tag_sort). The body of the request is the script itself. We can now move on to the query, which looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "id" : "tag_sort"        },        "type" : "string",        "order" : "asc"      }   } }' As we can see, this query is practically identical to the query used with the script defined in a file. The only difference is the id parameter instead of file. Querying with scripts If we look at any request made to Elasticsearch that uses scripts, we will notice some similar properties, which are as follows: script: The property that wraps the script definition. inline: The property holding the code of the script itself. id – This is the property that defines the identifier of the indexed script. file: The filename (without extension) with the script definition when the in file script is used. lang: This is the property defining the script language. If it is omitted, Elasticsearch assumes groovy. params: This is an object containing parameters and their values. Every defined parameter can be used inside the script by specifying that parameter name. Parameters allow us to write cleaner code that will be executed in a more efficient manner. Scripts that use parameters are executed faster than code with embedded constants because of caching. Scripting with parameters As our scripts become more and more complicated, the need for creating multiple, almost identical scripts can appear. Those scripts usually differ in the values used, with the logic behind them being exactly the same. In our simple example, we have used a hardcoded value to mark documents with an empty tags list. Let's change this to allow the definition of a hardcoded value. Let's use in the file script definition and create the tag_sort_with_param.groovy file with the following contents: _doc.tags.values.size() > 0 ? _doc.tags.values[0] : tvalue The only change we've made is the introduction of a parameter named tvalue, which can be set in the query in the following way: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "file" : "tag_sort_with_param",         "params" : {           "tvalue" : "000"         }        },        "type" : "string",        "order" : "asc"      }   } }' The params section defines all the script parameters. In our simple example, we've only used a single parameter, but of course, we can have multiple parameters in a single query. Script languages The default language for scripting is Groovy. However, you are not limited to only a single scripting language when using Elasticsearch. In fact, if you would like to, you can even use Java to write your scripts. In addition to that, the community behind Elasticsearch provides support of more languages as plugins. So, if you are willing to install plugins, you can extend the list of scripting languages that Elasticsearch supports even further. You may wonder why you should even consider using a scripting language other than the default Groovy. The first reason is your own preferences. If you are a Python enthusiast, you are probably now thinking about how to use Python for your Elasticsearch scripts. The other reason could be security. When we talked about inline scripts, we told you that inline scripts are turned off by default. This is not exactly true for all the scripting languages available out of the box. Inline scripts are disabled by default when using Grooby, but you can use Lucene expressions and Mustache without any issues. This is because those languages are sandboxed, which means that security-sensitive functions are turned off. And of course, the last factor when choosing the language is performance. Theoretically, native scripts (in Java) should have better performance than others, but you should remember that the difference can be insignificant. You should always consider the cost of development and measure the performance. Using something other than embedded languages Using Groovy for scripting is a simple and sufficient solution for most use cases. However, you may have a different preference and you would like to use something different, such as JavaScript, Python, or Mvel. For now, we'll just run the following command from the Elasticsearch directory: bin/plugin install elasticsearch/elasticsearch-lang-javascript/2.7.0 The preceding command will install a plugin that will allow the use of JavaScript as the scripting language. The only change we should make in the request is putting in additional information about the language we are using for scripting. And of course, we have to modify the script itself to correctly use the new language. Look at the following example: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "inline" : "_doc.tags.values.length > 0 ? _doc.tags.values[0] :"u19999";",         "lang" : "javascript"       },       "type" : "string",       "order" : "asc"     }   } }' As you can see, we've used JavaScript for scripting instead of the default Groovy. The lang parameter informs Elasticsearch about the language being used. Using native code If the scripts are too slow or if you don't like scripting languages, Elasticsearch allows you to write Java classes and use them instead of scripts. There are two possible ways of adding native scripts: adding classes that define scripts to the Elasticsearch classpath, or adding a script as a functionality provided by plugin. We will describe the second solution as it is more elegant. The factory implementation We need to implement at least two classes to create a new native script. The first one is a factory for our script. For now, let's focus on it. The following sample code illustrates the factory for our script: package pl.solr.elasticsearch.examples.scripts; import java.util.Map; import org.elasticsearch.common.Nullable; import org.elasticsearch.script.ExecutableScript; import org.elasticsearch.script.NativeScriptFactory; public class HashCodeSortNativeScriptFactory implements NativeScriptFactory {     @Override     public ExecutableScript newScript(@Nullable Map<String, Object> params) {         return new HashCodeSortScript(params);     }   @Override   public boolean needsScores() {     return false;   } } This class should implement the org.elasticsearch.script.NativeScriptFactory class. The interface forces us to implement two methods. The newScript() method takes the parameters defined in the API call and returns an instance of our script. Finally, needsScores() informs Elasticsearch if we want to use scoring and that it should be calculated. Implementing the native script Now let's look at the implementation of our script. The idea is simple—our script will be used for sorting. The documents will be ordered by the hashCode() value of the chosen field. Documents without a value in the defined field will be first on the results list. We know that the logic doesn't make much sense, but it is good for presentation as it is simple. The source code for our native script looks as follows: package pl.solr.elasticsearch.examples.scripts; import java.util.Map; import org.elasticsearch.script.AbstractSearchScript; public class HashCodeSortScript extends AbstractSearchScript {   private String field = "name";   public HashCodeSortScript(Map<String, Object> params) {     if (params != null && params.containsKey("field")) {       this.field = params.get("field").toString();     }   }   @Override   public Object run() {     Object value = source().get(field);     if (value != null) {       return value.hashCode();     }     return 0;   } } First of all, our class inherits from the org.elasticsearch.script.AbstractSearchScript class and implements the run() method. This is where we get the appropriate values from the current document, process it according to our strange logic, and return the result. You may notice the source() call. Yes, it is exactly the same _source parameter that we met in the non-native scripts. The doc() and fields() methods are also available, and they follow the same logic that we described earlier. The thing worth looking at is how we've used the parameters. We assume that a user can put the field parameter, telling us which document field will be used for manipulation. We also provide a default value for this parameter. The plugin definition We said that we will install our script as a part of a plugin. This is why we need additional files. The first file is the plugin initialization class, where we can tell Elasticsearch about our new script: package pl.solr.elasticsearch.examples.scripts; import org.elasticsearch.plugins.Plugin; import org.elasticsearch.script.ScriptModule; public class ScriptPlugin extends Plugin {   @Override   public String description() {    return "The example of native sort script";   }   @Override   public String name() {     return "naive-sort-plugin";   }   public void onModule(final ScriptModule module) {     module.registerScript("native_sort",       HashCodeSortNativeScriptFactory.class);   } } The implementation is easy. The description() and name() methods are only for information purposes, so let's focus on the onModule() method. In our case, we need access to script module—the Elasticsearch service connected with scripts and scripting languages. This is why we define onModule() with one ScriptModule argument. Thanks to Elasticsearch magic, we can use this module and register our script so that it can be found by the engine. We have used the registerScript() method, which takes the script name and the previously defined factory class. The second file needed is a plugin descriptor file: plugin-descriptor.properties. It defines the constants used by the Elasticsearch plugin subsystem. Without thinking more, let's look at the contents of this file: jvm=true classname=pl.solr.elasticsearch.examples.scripts.ScriptPlugin elasticsearch.version=2.0.0-beta2-SNAPSHOT version=0.0.1-SNAPSHOT name=native_script description=Example Native Scripts java.version=1.7 The appropriate lines have the following meaning: jvm: This tells Elasticsearch that our file contains Java code classname: This describes the main class with the plugin definition elasticsearch.version and java.version: They tell about the Elasticsearch and Java versions needed for our plugin name and description: These are an informative name and a short description of our plugin And that's it! We have all the files needed to fire our script. Note that now it is quite convenient to add new scripts and pack them as a single plugin. Installing a plugin Now it's time to install our native script embedded in the plugin. After packing the compiled classes as a JAR archive, we should put it into the Elasticsearch plugins/native-script directory. The native-script part is a root directory for our plugin and you may name it as you wish. In this directory, you also need the prepared plugin-descriptor.properties file. This makes our plugin visible to Elasticsearch. Running the script After restarting Elasticsearch (or the entire cluster if you are running more than a single node), we can start sending the queries that use our native script. For example, we will send a query that uses our previously indexed data from the library index. This example query looks as follows: curl -XGET 'localhost:9200/library/_search?pretty' -d '{   "query" : {     "match_all" : { }   },   "sort" : {     "_script" : {       "script" : {         "script" : "native_sort",         "lang" : "native",         "params" : {           "field" : "otitle"         }       },       "type" : "string",       "order" : "asc"     }   } }' Note the params part of the query. In this call, we want to sort on the otitle field. We provide the script name as native_sort and the script language as native. This is required. If everything goes well, we should see our results sorted by our custom sort logic. If we look at the response from Elasticsearch, we will see that documents without the otitle field are at the first few positions of the results list and their sort value is 0. Summary In this article, we focused on querying, but not about the matching part of it—mostly about scoring. You learned how Apache Lucene TF/IDF scoring works. We saw the scripting capabilities of Elasticsearch and handled multilingual data. We also used boosting to influence how scores of returned documents were calculated and we used synonyms. Finally, we used explain information to see how document scores were calculated by query. Resources for Article:   Further resources on this subject: An Introduction to Kibana [article] Indexing the Data [article] Low-Level Index Control [article]
Read more
  • 0
  • 0
  • 6356

article-image-basic-concepts-and-architecture-cassandra
Packt
21 Nov 2013
7 min read
Save for later

Basic Concepts and Architecture of Cassandra

Packt
21 Nov 2013
7 min read
(For more resources related to this topic, see here.) CAP theorem If you want to understand Cassandra, you first need to understand the CAP theorem. The CAP theorem (published by Eric Brewer at the University of California, Berkeley) basically states that it is impossible for a distributed system to provide you with all of the following three guarantees: Consistency: Updates to the state of the system are seen by all the clients simultaneously Availability: Guarantee of the system to be available for every valid request Partition tolerance: The system continues to operate despite arbitrary message loss or network partition Cassandra provides users with stronger availability and partition tolerance with tunable consistency tradeoff; the client, while writing to and/or reading from Cassandra, can pass a consistency level that drives the consistency requirements for the requested operations. BigTable / Log-structured data model In a BigTable data model, the primary key and column names are mapped with their respective bytes of value to form a multidimensional map. Each table has multiple dimensions. Timestamp is one such dimension that allows the table to version the data and is also used for internal garbage collection (of deleted data). The next figure shows the data structure in a visual context; the row key serves as the identifier of the column that follows it, and the column name and value are stored in contiguous blocks: It is important to note that every row has the column names stored along with the values, allowing the schema to be dynamic. Column families Columns are grouped into sets called column families, which can be addressed through a row key (primary key). All the data stored in a column family is of the same type. A column family must be created before any data can be stored; any column key within the family can be used. It is our intent that the number of distinct column families in a keyspace should be small, and that the families should rarely change during an operation. In contrast, a column family may have an unbounded number of columns. Both disk and memory accounting are performed at the column family level. Keyspace A keyspace is a group of column families; replication strategies and ACLs are performed at the keyspace level. If you are familiar with traditional RDBMS, you can consider the keyspace as an alternative name for the schema and the column family as an alternative name for tables. Sorted String Table (SSTable) An SSTable provides a persistent file format for Cassandra; it is an ordered immutable storage structure from rows of columns (name/value pairs). Operations are provided to look up the value associated with a specific key and to iterate over all the column names and value pairs within a specified key range. Internally, each SSTable contains a sequence of row keys and a set of column key/value pairs. There is an index and the start location of the row key in the index file, which is stored separately. The index summary is loaded into the memory when the SSTable is opened in order to optimize the amount of memory needed for the index. A lookup for actual rows can be performed with a single disk seek and by scanning sequentially for the data. Memtable A memtable is a memory location where data is written to during update or delete operations. A memtable is a temporary location and will be flushed to the disk once it is full to form an SSTable. Basically, an update or a write operation to Cassandra is a sequential write to the commit log in the disk and a memory update; hence, writes are as fast as writing to memory. Once the memtables are full, they are flushed to the disk, forming new SSTables: Reads in Cassandra will merge the data from different SSTables and the data in memtables. Reads should always be requested with a row key (primary key) with the exception of a key range scan. When multiple updates are applied to the same column, Cassandra uses client-provided timestamps to resolve conflicts. Delete operations to a column work a little differently; because SSTables are immutable, Cassandra writes the tombstone to avoid random writes. A tombstone is a special value written to Cassandra instead of removing the data immediately. The tombstone can then be sent to nodes that did not get the initial remove request, and can be removed during GC. Compaction To bound the number of SSTable files that must be consulted on reads and to reclaim the space taken by unused data, Cassandra performs compactions. In a nutshell, compaction compacts n (the configurable number of SSTables) into one big SSTable. They start out being the same size as the memtables. Therefore, the sizes of the SSTables are exponentially bigger when they grow older. Partitioning and replication Dynamo style As mentioned previously, the partitioner and replication scheme is motivated by the Dynamo paper; let's talk about it in detail. Gossip protocol Cassandra is a peer-to-peer system with no single point of failure; the cluster topology information is communicated via the Gossip protocol. The Gossip protocol is similar to real-world gossip, where a node (say B) tells a few of its peers in the cluster what it knows about the state of a node (say A). Those nodes tell a few other nodes about A, and over a period of time, all the nodes know about A. Distributed hash table The key feature of Cassandra is the ability to scale incrementally. This includes the ability to dynamically partition the data over a set of nodes in the cluster. Cassandra partitions data across the cluster using consistent hashing and randomly distributes the rows over the network using the hash of the row key. When a node joins the ring, it is assigned a token that advocates where the node has to be placed in the ring: Now consider a case where the replication factor is 3; clients randomly write or read from a coordinator (every node in the system can act as a coordinator and a data node) in the cluster. The node calculates a hash of the row key and provides the coordinator enough information to write to the right node in the ring. The coordinator also looks at the replication factor and writes to the neighboring nodes in the ring order. Eventual consistency Given a sufficient period of time over which no changes are sent, all updates can be expected to propagate through the system and the replicas created will be consistent. Cassandra supports both the eventual consistency model and strong consistency model, which can be controlled from the client while performing an operation. Cassandra supports various consistency levels while writing or reading data. The consistency level drives the number data replicas the coordinator has to contact to get the data before acknowledging the clients. If W + R > Replication Factor, where W is the number of nodes to block on write and R the number to block on reads, the clients will see a strong consistency behavior: ONE: R/W at least one node TWO: R/W at least two nodes QUORUM: R/W from at least floor (N/2) + 1, where N is the replication factor When nodes are down for maintenance, Cassandra will store hints for updates performed on that node, which can be delivered back when the node is available in the future. To make data consistent, Cassandra relies on hinted handoffs, read repairs, and anti-entropy repairs. Summary In this article, we have discussed basic concepts and basic building blocks, including the motivation in building a new datastore solution. Resources for Article: Further resources on this subject: Apache Cassandra: Libraries and Applications [Article] About Cassandra [Article] Quick start – Creating your first Java application [Article]
Read more
  • 0
  • 0
  • 6301

article-image-building-search-geo-locator-elasticsearch-and-spark
Packt
31 Jan 2017
12 min read
Save for later

Building A Search Geo Locator with Elasticsearch and Spark

Packt
31 Jan 2017
12 min read
In this article, Alberto Paro, the author of the book Elasticsearch 5.x Cookbook - Third Edition discusses how to use and manage Elasticsearch covering topics as installation/setup, mapping management, indices management, queries, aggregations/analytics, scripting, building custom plugins, and integration with Python, Java, Scala and some big data tools such as Apache Spark and Apache Pig. (For more resources related to this topic, see here.) Background Elasticsearch is a common answer for every needs of search on data and with its aggregation framework, it can provides analytics in real-time. Elasticsearch was one of the first software that was able to bring the search in BigData world. It’s cloud native design, JSON as standard format for both data and search, and its HTTP based approach are only the solid bases of this product. Elasticsearch solves a growing list of search, log analysis, and analytics challenges across virtually every industry. It’s used by big companies such as Linkedin, Wikipedia, Cisco, Ebay, Facebook, and many others (source https://www.elastic.co/use-cases). In this article, we will show how to easily build a simple search geolocator with Elasticsearch using Apache Spark for ingestion. Objective In this article, they will develop a search geolocator application using the world geonames database. To make this happen the following steps will be covered: Data collection Optimized Index creation Ingestion via Apache Spark Searching for a location name Searching for a city given a location position Executing some analytics on the dataset. All the article code is available on GitHub at https://github.com/aparo/elasticsearch-geonames-locator. All the below commands need to be executed in the code directory on Linux/MacOS X. The requirements are a local Elasticsearch Server instance, a working local Spark installation and SBT installed (http://www.scala-sbt.org/) . Data collection To populate our application we need a database of geo locations. One of the most famous and used dataset is the GeoNames geographical database, that is available for download free of charge under a creative commons attribution license. It contains over 10 million geographical names and consists of over 9 million unique features whereof 2.8 million populated places and 5.5 million alternate names. It can be easily downloaded from http://download.geonames.org/export/dump. The dump directory provided CSV divided in counties and but in our case we’ll take the dump with all the countries allCountries.zip file To download the code we can use wget via: wget http://download.geonames.org/export/dump/allCountries.zip Then we need to unzip it and put in downloads folder: unzip allCountries.zip mv allCountries.txt downloads The Geoname dump has the following fields: No. Attribute name Explanation 1 geonameid Unique ID for this geoname 2 name The name of the geoname 3 asciiname ASCII representation of the name 4 alternatenames Other forms of this name. Generally in several languages 5 latitude Latitude in decimal degrees of the Geoname 6 longitude Longitude in decimal degrees of the Geoname 7 fclass Feature class see http://www.geonames.org/export/codes.html 8 fcode Feature code see http://www.geonames.org/export/codes.html 9 country ISO-3166 2-letter country code 10 cc2 Alternate country codes, comma separated, ISO-3166 2-letter country code 11 admin1 Fipscode (subject to change to iso code 12 admin2 Code for the second administrative division, a county in the US 13 admin3 Code for third level administrative division 14 admin4 Code for fourth level administrative division 15 population The population of Geoname 16 elevation The elevation in meters of Geoname 17 gtopo30 Digital elevation model 18 timezone The timezone of Geoname 19 moddate The date of last change of this Geoname Table 1: Dataset characteristics Optimized Index creation Elasticsearch provides automatic schema inference for your data, but the inferred schema is not the best possible. Often you need to tune it for: Removing not-required fields Managing Geo fields. Optimizing string fields that are index twice in their tokenized and keyword version. Given the Geoname dataset, we will add a new field location that is a GeoPoint that we will use in geo searches. Another important optimization for indexing, it’s define the correct number of shards. In this case we have only 11M records, so using only 2 shards is enough. The settings for creating our optimized index with mapping and shards is the following one: { "mappings": { "geoname": { "properties": { "admin1": { "type": "keyword", "ignore_above": 256 }, "admin2": { "type": "keyword", "ignore_above": 256 }, "admin3": { "type": "keyword", "ignore_above": 256 }, "admin4": { "type": "keyword", "ignore_above": 256 }, "alternatenames": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "asciiname": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "cc2": { "type": "keyword", "ignore_above": 256 }, "country": { "type": "keyword", "ignore_above": 256 }, "elevation": { "type": "long" }, "fclass": { "type": "keyword", "ignore_above": 256 }, "fcode": { "type": "keyword", "ignore_above": 256 }, "geonameid": { "type": "long" }, "gtopo30": { "type": "long" }, "latitude": { "type": "float" }, "location": { "type": "geo_point" }, "longitude": { "type": "float" }, "moddate": { "type": "date" }, "name": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "population": { "type": "long" }, "timezone": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }, "settings": { "index": { "number_of_shards": "2", "number_of_replicas": "1" } } } We can store the above JSON in a file called settings.json and we can create an index via the curl command: curl -XPUT http://localhost:9200/geonames -d @settings.json Now our index is created and ready to receive our documents. Ingestion via Apache Spark Apache Spark is very hardy for processing CSV and manipulate the data before saving it in a storage both disk or NoSQL. Elasticsearch provides easy integration with Apache Spark allowing write Spark RDD with a single command in Elasticsearch. We will build a spark job called GeonameIngester that will execute the following steps: Initialize the Spark Job Parse the CSV Defining our required structures and conversions Populating our classes Writing the RDD in Elasticsearch Executing the Spark Job Initialize the Spark Job We need to import required classes: import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ import org.elasticsearch.spark.rdd.EsSpark import scala.util.Try We define the GeonameIngester object and the SparkSession: object GeonameIngester { def main(args: Array[String]) { val sparkSession = SparkSession.builder .master("local") .appName("GeonameIngester") .getOrCreate() To easy serialize complex datatypes, we switch to use the Kryo encoder: import scala.reflect.ClassTag implicit def kryoEncoder[A](implicit ct: ClassTag[A]) = org.apache.spark.sql.Encoders.kryo[A](ct) import sparkSession.implicits._ Parse the CSV For parsing the CSV, we need to define the Geoname schema to be used to read: val geonameSchema = StructType(Array( StructField("geonameid", IntegerType, false), StructField("name", StringType, false), StructField("asciiname", StringType, true), StructField("alternatenames", StringType, true), StructField("latitude", FloatType, true), StructField("longitude", FloatType, true), StructField("fclass", StringType, true), StructField("fcode", StringType, true), StructField("country", StringType, true), StructField("cc2", StringType, true), StructField("admin1", StringType, true), StructField("admin2", StringType, true), StructField("admin3", StringType, true), StructField("admin4", StringType, true), StructField("population", DoubleType, true), // Asia population overflows Integer StructField("elevation", IntegerType, true), StructField("gtopo30", IntegerType, true), StructField("timezone", StringType, true), StructField("moddate", DateType, true))) Now we can read all the geonames from CSV via: val GEONAME_PATH = "downloads/allCountries.txt" val geonames = sparkSession.sqlContext.read .option("header", false) .option("quote", "") .option("delimiter", "t") .option("maxColumns", 22) .schema(geonameSchema) .csv(GEONAME_PATH) .cache() Defining our required structures and conversions The plain CSV data is not suitable for our advanced requirements, so we define new classes to store our Geoname data. We define a GeoPoint object to store the Geo Point location of our geoname. case class GeoPoint(lat: Double, lon: Double) We define also our Geoname class with optional and list types: case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String], latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String, cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4: Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate: String) To reduce the boilerplate of the conversion we define an implicit method that convert a String in an Option[String] if it is empty or null. implicit def emptyToOption(value: String): Option[String] = { if (value == null) return None val clean = value.trim if (clean.isEmpty) { None } else { Some(clean) } } During processing, in case of the population value is null we need a function to fix this value and set it to 0: to do this we define a function to fixNullInt: def fixNullInt(value: Any): Int = { if (value == null) 0 else { Try(value.asInstanceOf[Int]).toOption.getOrElse(0) } } Populating our classes We can populate the records that we need to store in Elasticsearch via a map on geonames DataFrame. val records = geonames.map { row => val id = row.getInt(0) val lat = row.getFloat(4) val lon = row.getFloat(5) Geoname(id, row.getString(1), row.getString(2), Option(row.getString(3)).map(_.split(",").map(_.trim).filterNot(_.isEmpty).toList).getOrElse(Nil), lat, lon, GeoPoint(lat, lon), row.getString(6), row.getString(7), row.getString(8), row.getString(9), row.getString(10), row.getString(11), row.getString(12), row.getString(13), row.getDouble(14), fixNullInt(row.get(15)), row.getInt(16), row.getString(17), row.getDate(18).toString ) } Writing the RDD in Elasticsearch The final step is to store our new build DataFrame records in Elasticsearch via: EsSpark.saveToEs(records.toJavaRDD, "geonames/geoname", Map("es.mapping.id" -> "geonameid")) The value “geonames/geoname” are the index/type to be used for store the records in Elasticsearch. To maintain the same ID of the geonames in both CSV and Elasticsearch we pass an additional parameter es.mapping.id that refers to where find the id to be used in Elasticsearch geonameid in the above example. Executing the Spark Job To execute a Spark job you need to build a Jar with all the required library and than to execute it on spark. The first step is done via sbt assembly command that will generate a fatJar with only the required libraries. To submit the Spark Job in the jar, we can use the spark-submit command: spark-submit --class GeonameIngester target/scala-2.11/elasticsearch-geonames-locator-assembly-1.0.jar Now you need to wait (about 20 minutes on my machine) that Spark will send all the documents to Elasticsearch and that they are indexed. Searching for a location name After having indexed all the geonames, you can search for them. In case we want search for Moscow, we need a complex query because: City in geonames are entities with fclass=”P” We want skip not populated cities We sort by population descendent to have first the most populated The city name can be in name, alternatenames or asciiname field To achieve this kind of query in Elasticsearch we can use a simple Boolean with several should queries for match the names and some filter to filter out unwanted results. We can execute it via curl via: curl -XPOST 'http://localhost:9200/geonames/geoname/_search' -d '{ "query": { "bool": { "minimum_should_match": 1, "should": [ { "term": { "name": "moscow"}}, { "term": { "alternatenames": "moscow"}}, { "term": { "asciiname": "moscow" }} ], "filter": [ { "term": { "fclass": "P" }}, { "range": { "population": {"gt": 0}}} ] } }, "sort": [ { "population": { "order": "desc"}}] }' We used “moscow” lowercase because it’s the standard token generate for a tokenized string (Elasticsearch text type). The result will be similar to this one: { "took": 14, "timed_out": false, "_shards": { "total": 2, "successful": 2, "failed": 0 }, "hits": { "total": 9, "max_score": null, "hits": [ { "_index": "geonames", "_type": "geoname", "_id": "524901", "_score": null, "_source": { "name": "Moscow", "location": { "lat": 55.752220153808594, "lon": 37.61555862426758 }, "latitude": 55.75222, "population": 10381222, "moddate": "2016-04-13", "timezone": "Europe/Moscow", "alternatenames": [ "Gorad Maskva", "MOW", "Maeskuy", .... ], "country": "RU", "admin1": "48", "longitude": 37.61556, "admin3": null, "gtopo30": 144, "asciiname": "Moscow", "admin4": null, "elevation": 0, "admin2": null, "fcode": "PPLC", "fclass": "P", "geonameid": 524901, "cc2": null }, "sort": [ 10381222 ] }, Searching for cities given a location position We have processed the geoname so that in Elasticsearch, we were able to have a GeoPoint field. Elasticsearch GeoPoint field allows to enable search for a lot of geolocation queries. One of the most common search is to find cities near me via a Geo Distance Query. This can be achieved modifying the above search in curl -XPOST 'http://localhost:9200/geonames/geoname/_search' -d '{ "query": { "bool": { "filter": [ { "geo_distance" : { "distance" : "100km", "location" : { "lat" : 55.7522201, "lon" : 36.6155586 } } }, { "term": { "fclass": "P" }}, { "range": { "population": {"gt": 0}}} ] } }, "sort": [ { "population": { "order": "desc"}}] }' Executing an analytic on the dataset. Having indexed all the geonames, we can check the completes of our dataset and executing analytics on them. For example, it’s useful to check how many geonames there are for a single country and the feature class for every single top country to evaluate their distribution. This can be easily achieved using an Elasticsearch aggregation in a single query: curl -XPOST 'http://localhost:9200/geonames/geoname/_search' -d ' { "size": 0, "aggs": { "geoname_by_country": { "terms": { "field": "country", "size": 5 }, "aggs": { "feature_by_country": { "terms": { "field": "fclass", "size": 5 } } } } } }’ The result can be will be something similar: { "took": 477, "timed_out": false, "_shards": { "total": 2, "successful": 2, "failed": 0 }, "hits": { "total": 11301974, "max_score": 0, "hits": [ ] }, "aggregations": { "geoname_by_country": { "doc_count_error_upper_bound": 113415, "sum_other_doc_count": 6787106, "buckets": [ { "key": "US", "doc_count": 2229464, "feature_by_country": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 82076, "buckets": [ { "key": "S", "doc_count": 1140332 }, { "key": "H", "doc_count": 506875 }, { "key": "T", "doc_count": 225276 }, { "key": "P", "doc_count": 192697 }, { "key": "L", "doc_count": 79544 } ] } },…truncated… These are simple examples how to easy index and search data with Elasticsearch. Integrating Elasticsearch with Apache Spark it’s very trivial: the core of part is to design your index and your data model to efficiently use it. After having correct indexed your data to cover your use case, Elasticsearch is able to provides your result or analytics in few microseconds. Summary In this article, we learned how to easily build a simple search geolocator with Elasticsearch using Apache Spark for ingestion. Resources for Article: Further resources on this subject: Basic Operations of Elasticsearch [article] Extending ElasticSearch with Scripting [article] Integrating Elasticsearch with the Hadoop ecosystem [article]
Read more
  • 0
  • 0
  • 6282

article-image-deployment-reports-birt
Packt
15 Oct 2009
4 min read
Save for later

Deployment of Reports with BIRT

Packt
15 Oct 2009
4 min read
Everything in this article uses utilities from the BIRT Runtime installation package, available from the BIRT homepage at http://www.eclipse.org/birt. BIRT Viewer The BIRT Viewer is a J2EE application that is designed to demonstrate how to implement the Report Engine API to execute reports in an online web application. For most basic uses—such as for small to medium size Intranet applications—this is an appropriate approach. The point to keep in mind about the BIRT Web Viewer is that it is an example application. It can be used as a baseline for more sophisticated web applications that will implement the BIRT Report Engine API. Installation of the BIRT Viewer is documented at a number of places. The Eclipse BIRT website has some great tutorials at: http://www.eclipse.org/birt/phoenix/deploy/viewerSetup.php http://wiki.eclipse.org/BIRT/FAQ/Deployment This is also documented on my website in a series of articles introducing people to BIRT: http://digiassn.blogspot.com/2005/10/birt-report-server-pt-2.html I won't go into the details about installing Apache Tomcat as this is covered in depth in other locations, but I will cover how to install the Viewer in a Tomcat environment. For the most part these instructions can be used in other J2EE containers, such as WebSphere. In some cases a WAR package is used instead. I prefer Tomcat because it is a widely used open-source J2EE environment. Under the BIRT Runtime package is a folder containing an example Web Viewer application. The Web Viewer is a useful application as you require basic report viewing capabilities, such as parameter passing, pagination, and export capabilities to formats such as Word, Excel, RTF, and CSV. For this example, I have Apache Tomcat 5.5 installed into a folder at C:apache-tomcat-5.5.25. To install the Web Viewer, I simply need to copy the WebViewerExample folder from the BIRT Runtime to the web application folder at C:apache-tomcat-5.5.25webapps. Accessing the BIRT Web Viewer is as simple as calling the WebViewerExample Context. When copying the WebViewerExample folder, you can rename this folder to anything you want. Obviously WebViewerExample is not a good name for an online web application. So in the following screenshot, I renamed the WebViewerExample folder to birtViewer, and am accessing the BIRT Web Viewer test report. Installing Reports into the Web Viewer Once the BIRT Viewer is set up, Deploying reports is as simple as copying the report design files, Libraries, or report documents into the application's Context, and calling it with the appropriate URL parameters. For example, we will install the reports from the Classic Cars – With Library folder into the BIRT Web Viewer at birtViewer. In order for these reports to work, all dependent Libraries need to be installed with the reports. In the case of the example application, we currently have the report folder set to the Root of the web application folder. Accessing Reports in the Web Viewer Accessing reports is as simple as passing the correct parameters to the Web Viewer. In the BIRT Web Viewer, there are seven servlets that you can call to run reports, which are as follows: frameset run preview download parameter document output Out of these, you will only need frameset and run as the other servlets are for Engine-related purposes, such as the preview for the Eclipse designer, the parameter Dialog, and the download of report documents. Out of the these two servlets, frameset is the one that is typically used for user interaction with reports, as it provides the pagination options, parameter Dialogs, table of contents viewing, and export and print Dialogs. The run servlet only provides report output. There are a few URL parameters for the BIRT Web Viewer, such as: __format : which is the output format, either HTML or PDF. __isnull: which sets a Report Parameter to null, parameter name as a value. __locale: which is the reports locale. __report: which is the report design file to run. __document: which is the report document file to open. Any remaining URL parameter will be treated as a Report Parameter. In the following image, I am running the Employee_Sales_Percentage.rptdesign file with the startDate and endDate parameters set.  
Read more
  • 0
  • 0
  • 6279

article-image-iteration-and-searching-keys
Packt
13 Nov 2013
7 min read
Save for later

Iteration and Searching Keys

Packt
13 Nov 2013
7 min read
(For more resources related to this topic, see here.) Introducing Sample04 to show you loops and searches Sample04 uses the same LevelDbHelpers.h as before. Please download the entire sample and look at main04.cpp to see the code in context. Running Sample04 starts by printing the output from the entire database, as shown in the following screenshot: Console output of listing keys Creating test records with a loop The test data being used here was created with a simple loop and forms a linked list as well. It is explained in more detail in the Simple Relational Style section. The loop creating the test data uses the new C++11 range-based for style of the loop: vector<string> words {"Packt", "Packer", "Unpack", "packing","Packt2", "Alpacca"}; stringprevKey; WriteOptionssyncW; syncW.sync = true; WriteBatchwb; for (auto key : words) { wb.Put(key, prevKey + "tsome more content"); prevKey = key; } assert(db->Write(syncW, &wb).ok() ); Note how we're using a string to hang onto the prevKey. There may be a temptation to use a Slice here to refer to the previous value of key, but remember the warning about a Slice only having a data pointer. This would be a classic bug introduced with a Slice pointing to a value that can be changed underneath it! We're adding all the keys using a WriteBatch not just for consistency, but also so that the storage engine knows it's getting a bunch of updates in one go and can optimize the file writing. I will be using the term Record regularly from now on. It's easier to say than Key-value Pair and is also indicative of the richer, multi-value data we're storing. Stepping through all the records with iterators The model for multiple record reading in LevelDB is a simple iteration. Find a starting point and then step forwards or backwards. This is done with an Iterator object that manages the order and starting point of your stepping through keys and values. You call methods on Iterator to choose where to start, to step and to get back the key and value. Each Iterator gets a consistent snapshot of the database, ignoring updates during iteration. Create a new Iterator to see changes. If you have used declarative database APIs such as SQL-based databases, you would be used to performing a query and then operating on the results. Many of these APIs and older, record-oriented databases have a concept of a cursor which maintains the current position in the results which you can only move forward. Some of them allow you to move the cursor to the previous records. Iterating through individual records may seem clunky and old-fashioned if you are used to getting collections from servers. However, remember LevelDB is a local database. Each step doesn't represent a network operation! The iterable cursor approach is all that LevelDB offers, called an Iterator. If you want some way of mapping a collected set of results directly to a listbox or other containers, you will have to implement it on top of the Iterator, as we will see later. Iterating forwards, we just get an Iterator from our database and jump to the first record with SeekToFirst(): Iterator* idb = db->NewIterator(ropt); for (idb->SeekToFirst(); idb->Valid(); idb->Next()) cout<<idb->key() <<endl; Going backwards is very similar, but inherently less efficient as a storage trade-off: for (idb->SeekToLast(); idb->Valid(); idb->Prev()) cout<<idb->key() <<endl; If you wanted to see the value as well as the keys, just use the value() method on the iterator (the test data in Sample04 would make it look a bit confusing so it isn't being done here): cout<<idb->key() << " " <<idb->value() <<endl; Unlike some other programming iterators, there's no concept of a special forward or backward iterator and no obligation to keep going in the same direction. Consider searching an HR database for the ten highest-paid managers. With a key of Job+Salary, you would iterate through a range until you know you have hit the end of the managers, then iterate backwards to get the last ten. An iterator is created by NewIterator(), so you have to remember to delete it or it will leak memory. Iteration is over a consistent snapshot of the data, and any data changes through Put, Get, or Delete operations won't show until another NewIterator() is created. Searching for ranges of keys The second half of the console output is from our examples of iterating through partial keys, which are case-sensitive by default, with the default BytewiseComparator: Console output of searches As we've seen many times, the Get function looks for an exact match for a key. However, if you have an Iterator, you can use Seek and it will jump to the first key that either matches exactly or is immediately after the partial key you specify. If we are just looking for keys with a common prefix, the optimal comparison is using the starts_with method of the Slice class: Void listKeysStarting(Iterator* idb, const Slice& prefix) { cout<< "List all keys starting with " <<prefix.ToString() <<endl; for (idb->Seek(prefix); idb-<Valid() &&idb->key().starts_with(prefix); idb-<Next()) cout<<idb->key() <<endl; } Going backwards is a little bit more complicated. We use a key that is guaranteed to fail. You could think of it as being between the last key starting with our prefix and the next key out of the desired range. When we Seek to that key, we need to step once to the previous key. If that's valid and matching, it's the last key in our range: Void listBackwardsKeysStarting(Iterator* idb, const Slice& prefix) { cout<<"List all keys starting with " <<prefix.ToString() << " backwards " <<endl; const string keyAfter = prefix.ToString() + "xFF"; idb->Seek(keyAfter); if (idb->Valid()) idb->Prev(); // step to last key with actual prefix else // no key just after our range, but idb->SeekToLast(); // maybe the last key matches? for(;idb->Valid() &&idb->key().starts_with(prefix); idb->Prev()) cout<<idb->key() <<endl; } What if you want to get keys within a range? For the first time, I disagree with the documentation included with LevelDB. Their iteration example shows a similar loop to that shown in the following code, but checks the key values with idb->key().ToString() < limit. That is a more expensive way to iterate keys as it's generating a temporary string object for every key being checked, which is expensive if there were thousands in the range: Void listKeys Between(Iterator* idb, const Slice&startKey, const Slice&endKey) { cout<< "List all keys >= " <<startKey.ToString() << " and < " <<endKey.ToString() <<endl; for (idb->Seek(startKey); idb->Valid() &&idb->key().compare(endKey) < 0; idb->Next()) cout<<idb->key() <<endl; } We can use another built-in method of Slice; the compare() method, which returns a result <0, 0, or >0 to indicate if Slice is less than, equal to, or greater than the other Slice it is being compared to. This is the same semantics as the standard C memcpy. The code shown in the previous snippet will find keys that are the same, or after the startKey and are before the endKey. If you want the range to include the endKey, change the comparison to compare(endKey) <= 0. Summary In this article, we learned the concept of an iterator in LevelDB as a way to step through records sorted by their keys. The database became far more useful with searches to get the starting point for the iterator, and samples showing how to efficiently check keys as you step through a range. Resources for Article : Further resources on this subject: New iPad Features in iOS 6 [Article] Securing data at the cell level (Intermediate) [Article] Python Data Persistence using MySQL Part III: Building Python Data Structures Upon the Underlying Database Data [Article]
Read more
  • 0
  • 0
  • 6278
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-creating-network-graphs-gephi
Packt
08 Oct 2013
10 min read
Save for later

Creating Network Graphs with Gephi

Packt
08 Oct 2013
10 min read
(For more resources related to this topic, see here.) Basic network graph terminology Network graphs are essentially based on the construct of nodes and edges. Nodes represent points or entities within the data, while edges refer to the connections or lines between nodes. Individual nodes might be students in a school, or schools within an educational system, or perhaps agencies within a government structure. Individual nodes may be represented through equal sizes, but can also be depicted as smaller or larger based on the magnitude of a selected measure. For example, a node with many connections may be portrayed as far larger and thus more influential than a sparsely connected node. This approach will provide viewers with a visual cue that shows them where the highest (and lowest) levels of activity occur within a graph. Nodes will generally be positioned based on the similarity of their individual connections, leading to clusters of individual nodes within a larger network. In most network algorithms, nodes with higher levels of connections will also tend to be positioned near the center of the graph, while those with few connections will move toward the perimeter of the display. Edges are the connections between nodes, and may be displayed as undirected or directed paths. Undirected relationships indicate a connection that flows in both directions, while a directed relationship moves in a single direction that always originates in one node and moves toward another node. Undirected connections will tend to predominate in most cases, such as in social networks where participant activity flows in both directions. On occasion, we will see directed connections, as in the case of some transportation or technology networks where there are connections that flow in a single direction. Edges may also be weighted, to show the strength of the connection between nodes. In the case where University A has performed research with both University B and University C, the strength (width) of the edge will show viewers where the stronger relationship exists. If A and B have combined for three projects, while A and C have collaborated on 9 projects, we should weight the A to C connection three times that of the A to B connection. Another commonly used term is neighbors, which is nothing more than a node that is directly connected to a second node. Neighbors can be stated to be one degree apart. Degrees is the term used to refer to the number of connections flowing into (or away from) a node (also known as Degree Centrality), as well as to the number of connections required to connect to another node via the shortest possible path. In complex graphs, you may find nodes that are four, five, or even more degrees away from a distant node, and in some cases two nodes may not be connected at all. Now that you have a very basic understanding of network graph theory, let's learn about some of the common network graph algorithms. Common network graph algorithms Before we introduce you to some specific graph algorithms, we'll briefly discuss some of the theory behind network graphs and introduce you to a few of the terms you will frequently encounter. Network graphs are drawn through positioning nodes and their respective connections relative to one another. In the case of a graph with 8 or 10 nodes, this is a rather simple exercise, and could probably be drawn rather accurately without the help of complex methodologies. However, in the typical case where we have hundreds of nodes with thousands of edges, the task becomes far more complex. Some of the more prominent graph classes in Gephi include the following: Force-directed algorithms refer to a category of graphs that position elements based on the principles of attraction, repulsion, and gravity Circular algorithms position graph elements around the perimeter of one or more circles, and may allow the user to dictate the order of the elements based on data properties Dual circle layouts position a subset of nodes in the interior of the graph with the remaining nodes around the diameter, similar to a single circular graph Concentric layouts arrange the graph nodes using an approximately circular graph design, with less connected nodes at the perimeter of the graph and highly connected nodes typically near the center Radial axis layouts provide the user with the ability to determine some portion of the graph layout by defining graph attributes The type of graph you select may well be dictated by the sort of results you seek. If you wish to feature certain groups within your dataset, one of the layouts that allows you to segment the graph by groups will provide a potentially quick solution to your needs. In this instance, one of the circular or radial axis graphs may prove ideal. On the other hand, if you are attempting to discover relationships in a complex new dataset, one of the several available Force-directed layouts is likely a better choice. These algorithms will rely on the values in your dataset to determine the positioning within the graph. When choosing one of these approaches, please note that there will often be an extensive runtime to calculate the graph layout, especially as the data becomes more complex. Even on a powerful computer, examples may run for minutes or hours in an attempt to fully optimize the graph. Fortunately, you will have the ability in Gephi to stop these algorithms at any given point, and you will still have a very reasonable, albeit imperfect graph. In the next section, we'll look at a few of the standard layouts that are part of the Gephi base package. Standard network graph layouts Now that you are somewhat familiar with the types of layout algorithms, we'll take a look at what Gephi offers within the Layout tab. We'll begin with some common Force-directed approaches, and then examine some of the other choices. One of the best known force algorithms is Force Atlas, which in Gephi provides users with multiple options for drawing the graph. Foremost among these settings are Repulsion, Attraction, and Gravity settings. Repulsion strength adjustments will make individual nodes either more or less sensitive to other nodes they differ from. A higher repulsion level, for example, will push these nodes further apart. Conversely, setting the Attraction strength higher will force related nodes closer together. Finally, the Gravity setting will draw nodes closer to the center of the graph if it is set to a high level, and disperse them toward the edges if a very low value is set. Force Atlas 2 is another layout option that employs slightly different parameters than the original Force Atlas method. You may wish to compare these methods and determine which one gives you better results. Fruchterman Reingold is one more Force method; albeit one that provides you with just three parameters – Area, Gravity, and Speed. While the general approach is not unlike the Force Atlas algorithms, your results will appear different in a visual sense. Finally, Gephi provides three Yifan Hu methods. Each of these models—Yifan Hu, Yifan Hu Proportional, and Yifan Hu Multilevel, are likely to run much more rapidly than the methods discussed earlier, while providing generally similar results. Gephi also provides a variety of methods that do not employ the force approach. Some of the models, as we noted earlier in this article, provide you with more control over the final result. This may be the result of selecting how to order the nodes, or of which attributes to use in grouping nodes together, either through color or location. In the section above, I referenced several layout options, but in the interest of space we'll take a closer look at two of them—the Circular and Radial Axis layouts. Circular layouts are well suited to relatively small networks, given the limited flexibility of their fixed layout. We can adjust this to some degree by specifying the diameter of the graph, but anything more than a few dozen well-connected nodes often becomes difficult to manage. However, with smaller networks, these layouts can be intriguing, providing us with the ability to see patterns within and between specific groups more easily than we might see them in some other layouts. While this article will not cover any filtering options, those too can be used to help us better utilize the circular layouts, by providing us with the ability to highlight specific groups and their resulting connections. Think of the circle resembling a giant spider web filled with connections, and the filters as tools that help us see specific threads within the web. Our final notes are on Radial Axis layouts, which can provide us with fascinating looks at our data, especially if there are natural groups within the network. Think of a school with several classrooms full of students, for example. Each of these classrooms can be easily identified and grouped, perhaps by color. In a complex force directed graph we may be able to spot each of these groups, but it may become difficult due to the interactions with other classes. In a Radial Axis layout we can dictate the group definitions, forcing each group to be bound together, apart from any other groups. There are pros and cons to this approach, of course, as there are with any of the other methods. If we wish to understand how a specific group interacts with another group, this method can prove beneficial, as it isolates these groups visually, making it easier to see connections between them. On the negative side, it is often quite difficult to see connections between members within the group, due to the nature of the layout. As with any layouts, it is critical to look at the results and see how they apply to our original need. Always test your data using multiple layout algorithms, so that you wind up with the best possible approach. Summary Gephi is an ideal tool for users new to network graph analysis and visualization, as it provides a rich set of tools to create and customize network graphs. The user interface makes it easy to understand basic concepts such as nodes and edges, as well as descriptive terminology such as neighbors, degrees, repulsion, and attraction. New users can move as slowly or as rapidly as they wish, given Gephi's gentle learning curve. Gephi can also help you see and understand patterns within your data through a variety of sophisticated graph methods that will appeal to both the novice as well as seasoned users. The variety of sophisticated layout algorithms will provide you the opportunity to experiment with multiple layouts as you search for the best approach to display your data. In short, Gephi provides everything needed to produce first-rate network visualizations. Resources for Article: Further resources on this subject: OpenSceneGraph: Advanced Scene Graph Components [Article] Cacti: Using Graphs to Monitor Networks and Devices [Article] OpenSceneGraph: Managing Scene GraphOpenSceneGraph: Managing Scene Graph [Article]
Read more
  • 0
  • 0
  • 6270

article-image-disaster-recovery-mysql-python
Packt
25 Sep 2010
10 min read
Save for later

Disaster Recovery in MySQL for Python

Packt
25 Sep 2010
10 min read
  MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server Read more about this book (For more resources on Phython, see here.) The purpose of the archiving methods covered in this article is to allow you, as the developer, to back up databases that you use for your work without having to rely on the database administrator. As noted later in the article, there are more sophisticated methods for backups than we cover here, but they involve system-administrative tasks that are beyond the remit of any development post and are thus beyond the scope of this article. Every database needs a backup plan When archiving a database, one of the critical questions that must be answered is how to take a snapshot backup of the database without having users change the data in the process. If data changes in the midst of the backup, it results in an inconsistent backup and compromises the integrity of the archive. There are two strategic determinants for backing up a database system: Offline backups Live backups Which you use depends on the dynamics of the system in question and the import of the data being stored. In this article, we will look at each in turn and the way to implement them. Offline backups Offline backups are done by shutting down the server so the records can be archived without the fear of them being changed by the user. It also helps to ensure the server shut down gracefully and that errors were avoided. The problem with using this method on most production systems is that it necessitates a temporary loss of access to the service. For most service providers, such a consequence is anathema to the business model. The value of this method is that one can be certain that the database has not changed at all while the backup is run. Further, in many cases, the backup is performed faster because the processor is not simultaneously serving data. For this reason, offline backups are usually performed in controlled environments or in situations where disruption is not critical to the user. These include internal databases, where administrators can inform all users about the disruption ahead of time, and small business websites that do not receive a lot of traffic. Offline backups also have the benefit that the backup is usually held in a single file. This can then be used to copy a database across hosts with relative ease. Shutting down a server obviously requires system administrator-like authority. So creating an offline backup relies on the system administrator shutting down the server. If your responsibilities include database administration, you will also have sufficient permission to shut down the server. Live backups Live backups occur while the server continues to accept queries from users, while it's still online. It functions by locking down the tables so no new data may be written to them. Users usually do not lose access to the data and the integrity of the archive, for a particular point in time is assured. Live backups are used by large, data-intensive sites such as Nokia's Ovi services and Google's web services. However, because they do not always require administrator access of the server itself, these tend to suit the backup needs of a development project. Choosing a backup method After having determined whether a database can be stopped for the backup, a developer can choose from three methods of archiving: Copying the data files (including administrative files such as logs and tablespaces) Exporting delimited text files Backing up with command-line programs Which you choose depends on what permissions you have on the server and how you are accessing the data. MySQL also allows for two other forms of backup: using the binary log and by setting up replication (using the master and slave servers). To be sure, these are the best ways to back up a MySQL database. But, both of these are administrative tasks and require system-administrator authority; they are not typically available to a developer. However, you can read more about them in the MySQL documentation. Use of the binary log for incremental backups is documented at: http://dev.mysql.com/doc/refman/5.5/en/point-in-time-recovery.html Setting up replication is further dealt with at: http://dev.mysql.com/doc/refman/5.5/en/replication-solutions-backups.html Copying the table files The most direct way to back up database files is to copy from where MySQL stores the database itself. This will naturally vary based on platform. If you are unsure about which directory holds the MySQL database files, you can query MySQL itself to check: mysql> SHOW VARIABLES LIKE 'datadir'; Alternatively, the following shell command sequence will give you the same information: $ mysqladmin variables | grep datadir| datadir | /var/lib/mysql/ | Note that the location of administrative files, such as binary logs and InnoDB tablespaces are customizable and may not be in the data directory. If you do not have direct access to the MySQL server, you can also write a simple Python program to get the information: #!/usr/bin/env pythonimport MySQLdbmydb = MySQLdb.connect('<hostname>', '<user>', '<password>')cursor = mydb.cursor()runit = cursor.execute("SHOW VARIABLES LIKE 'datadir'")results = cursor.fetchall()print "%s: %s" %(cursor.fetchone()) Slight alteration of this program will also allow you to query several servers automatically. Simply change the login details and adapt the output to clarify which data is associated with which results. Locking and flushing If you are backing up an offline MyISAM system, you can copy any of the files once the server has been stopped. Before backing up a live system, however, you must lock the tables and flush the log files in order to get a consistent backup at a specific point. These tasks are handled by the LOCK TABLES and FLUSH commands respectively. When you use MySQL and its ancillary programs (such as mysqldump) to perform a backup, these tasks are performed automatically. When copying files directly, you must ensure both are done. How you apply them depends on whether you are backing up an entire database or a single table. LOCK TABLES The LOCK TABLES command secures a specified table in a designated way. Tables can be referenced with aliases using AS and can be locked for reading or writing. For our purposes, we need only a read lock to create a backup. The syntax looks like this: LOCK TABLES <tablename> READ; This command requires two privileges: LOCK TABLES and SELECT. It must be noted that LOCK TABLES does not lock all tables in a database but only one. This is useful for performing smaller backups that will not interrupt services or put too severe a strain on the server. However, unless you automate the process, manually locking and unlocking tables as you back up data can be ridiculously inefficient. FLUSH The FLUSH command is used to reset MySQL's caches. By re-initiating the cache at the point of backup, we get a clear point of demarcation for the database backup both in the database itself and in the logs. The basic syntax is straightforward, as follows: FLUSH <the object to be reset>; Use of FLUSH presupposes the RELOAD privilege for all relevant databases. What we reload depends on the process we are performing. For the purpose of backing up, we will always be flushing tables: FLUSH TABLES; How we "flush" the tables will depend on whether we have already used the LOCK TABLES command to lock the table. If we have already locked a given table, we can call FLUSH for that specific table: FLUSH TABLES <tablename>; However, if we want to copy an entire database, we can bypass the LOCK TABLES command by incorporating the same call into FLUSH: FLUSH TABLES WITH READ LOCK; This use of FLUSH applies across the database, and all tables will be subject to the read lock. If the account accessing the database does not have sufficient privileges for all databases, an error will be thrown. Unlocking the tables Once you have copied the files for a backup, you need to remove the read lock you imposed earlier. This is done by releasing all locks for the current session: UNLOCK TABLES; Restoring the data Restoring copies of the actual storage files is as simple as copying them back into place. This is best done when MySQL has stopped, lest you risk corruption. Similarly, if you have a separate MySQL server and want to transfer a database, you simply need to copy the directory structure from the one server to another. On restarting, MySQL will see the new database and treat it as if it had been created natively. When restoring the original data files, it is critical to ensure the permissions on the files and directories are appropriate and match those of the other MySQL databases. Delimited backups within MySQL MySQL allows for exporting of data from the MySQL command line. To do so, we simply direct the output from a SELECT statement to an output file. Using SELECT INTO OUTFILE to export data Using sakila, we can save the data from film to a file called film.data as follows: SELECT * INTO OUTFILE 'film.data' FROM film; This results in the data being written in a tab-delimited format. The file will be written to the directory in which MySQL stores the sakila data. Therefore, the account under which the SELECT statement is executed must have the FILE privilege for writing the file as well as login access on the server to view it or retrieve it. The OUTFILE option on SELECT can be used to write to any place on the server that MySQL has write permission to use. One simply needs to prepend that directory location to the file name. For example, to write the same file to the /tmp directory on a Unix system, use: SELECT * INTO OUTFILE '/tmp/film.data' FROM film; Windows simply requires adjustment of the directory structure accordingly. Using LOAD DATA INFILE to import data If you have an output file or similar tab-delimited file and want to load it into MySQL, use the LOAD DATA INFILE command. The basic syntax is: LOAD DATA INFILE '<filename>' INTO TABLE <tablename>; For example, to import the film.data file from the /tmp directory into another table called film2, we would issue this command: LOAD DATA INFILE '/tmp/film.data' INTO TABLE film2; Note that LOAD DATA INFILE presupposes the creation of the table into which the data is being loaded. In the preceding example, if film2 had not been created, we would receive an error. If you are trying to mirror a table, remember to use the SHOW CREATE TABLE query to save yourself time in formulating the CREATE statement. This discussion only touches on how to use LOAD DATA INFILE for inputting data created with the OUTFILE option of SELECT. But, the command handles text files with just about any set of delimiters. To read more on how to use it for other file formats, see the MySQL documentation at: http://dev.mysql.com/doc/refman/5.5/en/load-data.html
Read more
  • 0
  • 0
  • 6243

article-image-understanding-model-based-clustering
Packt
14 Sep 2015
10 min read
Save for later

Understanding Model-based Clustering

Packt
14 Sep 2015
10 min read
 In this article by Ashish Gupta, author of the book, Rapid – Apache Mahout Clustering Designs, we will discuss a model-based clustering algorithm. Model-based clustering is used to overcome some of the deficiencies that can occur in K-means or Fuzzy K-means algorithms. We will discuss the following topics in this article: Learning model-based clustering Understanding Dirichlet clustering Understanding topic modeling (For more resources related to this topic, see here.) Learning model-based clustering In model-based clustering, we assume that data is generated by a model and try to get the model from the data. The right model will fit the data better than other models. In the K-means algorithm, we provide the initial set of cluster, and K-means provides us with the data points in the clusters. Think about a case where clusters are not distributed normally, then the improvement of a cluster will not be good using K-means. In this scenario, the model-based clustering algorithm will do the job. Another idea you can think of when dividing the clusters is—hierarchical clustering—and we need to find out the overlapping information. This situation will also be covered by model-based clustering algorithms. If all components are not well separated, a cluster can consist of multiple mixture components. In simple terms, in model-based clustering, data is a mixture of two or more components. Each component has an associated probability and is described by a density function. Model-based clustering can capture the hierarchy and the overlap of the clusters at the same time. Partitions are determined by an EM (expectation-maximization) algorithm for maximum likelihood. The generated models are compared by a Bayesian Information criterion (BIC). The model with the lowest BIC is preferred. In the equation BIC = -2 log(L) + mlog(n), L is the likelihood function and m is the number of free parameters to be estimated. n is the number of data points. Understanding Dirichlet clustering Dirichlet clustering is a model-based clustering method. This algorithm is used to understand the data and cluster the data. Dirichlet clustering is a process of nonparametric and Bayesian modeling. It is nonparametric because it can have infinite number of parameters. Dirichlet clustering is based on Dirichlet distribution. For this algorithm, we have a probabilistic mixture of a number of models that are used to explain data. Each data point will be coming from one of the available models. The models are taken from the sample of a prior distribution of models, and points are assigned to these models iteratively. In each iteration probability, a point generated by a particular model is calculated. After the points are assigned to a model, new parameters for each of the model are sampled. This sample is from the posterior distribution of the model parameters, and it considers all the observed data points assigned to the model. This sampling provides more information than normal clustering listed as follows: As we are assigning points to different models, we can find out how many models are supported by the data. The other information that we can get is how well the data is described by a model and how two points are explained by the same model. Topic modeling In machine learning, topic modeling is nothing but finding out a topic from the text document using a statistical model. A document on particular topics has some particular words. For example, if you are reading an article on sports, there are high chances that you will get words such as football, baseball, Formula One and Olympics. So a topic model actually uncovers the hidden sense of the article or a document. Topic models are nothing but the algorithms that can discover the main themes from a large set of unstructured document. It uncovers the semantic structure of the text. Topic modeling enables us to organize large scale electronic archives. Mahout has the implementation of one of the topic modeling algorithms—Latent Dirichlet Allocation (LDA). LDA is a statistical model of document collection that tries to capture the intuition of the documents. In normal clustering algorithms, if words having the same meaning don't occur together, then the algorithm will not associate them, but LDA can find out which two words are used in similar context, and LDA is better than other algorithms in finding out the association in this way. LDA is a generative, probabilistic model. It is generative because the model is tweaked to fit the data, and using the parameters of the model, we can generate the data on which it fits. It is probabilistic because each topic is modeled as an infinite mixture over an underlying set of topic probabilities. The topic probabilities provide an explicit representation of a document. Graphically, a LDA model can be represented as follows: The notation used in this image represents the following: M, N, and K represent the number of documents, the number of words in the document, and the number of topics in the document respectively. is the prior weight of the K topic in a document. is the prior weight of the w word in a topic. φ is the probability of a word occurring in a topic. Θ is the topic distribution. z is the identity of a topic of all the words in all the documents. w is the identity of all the words in all the documents. How LDA works in a map-reduce mode? So these are the steps that LDA follows in mapper and reducer steps: Mapper phase: The program starts with an empty topic model. All the documents are read by different mappers. The probabilities of each topic for each word in the document are calculated. Reducer Phase: The reducer receives the count of probabilities. These counts are summed and the model is normalized. This process is iterative, and in each iteration the sum of the probabilities is calculated and the process stops when it stops changing. A parameter set, which is similar to the convergence threshold in K-means, is set to check the changes. In the end, LDA estimates how well the model fits the data. In Mahout, the Collapsed Variation Bayes (CVB) algorithm is implemented for LDA. LDA uses a term frequency vector as an input and not tf-idf vectors. We need to take care of the two parameters while running the LDA algorithm—the number of topics and the number of words in the documents. A higher number of topics will provide very low level topics while a lower number will provide a generalized topic at high level, such as sports. In Mahout, mean field variational inference is used to estimate the model. It is similar to expectation-maximization of hierarchical Bayesian models. An expectation step reads each document and calculates the probability of each topic for each word in every document. The maximization step takes the counts and sums all the probabilities and normalizes them. Running LDA using Mahout To run LDA using Mahout, we will use the 20 Newsgroups dataset. We will convert the corpus to vectors, run LDA on these vectors, and get the resultant topics. Let's run this example to view how topic modeling works in Mahout. Dataset selection We will use the 20 Newsgroup dataset for this exercise. Download the 20news-bydate.tar.gz dataset from http://qwone.com/~jason/20Newsgroups/. Steps to execute CVB (LDA) Perform the following steps to execute the CVB algorithm: Create a 20newsdata directory and unzip the data here: mkdir /tmp/20newsdata cdtmp/20newsdatatar-xzvf /tmp/20news-bydate.tar.gz There are two folders under 20newsdata: 20news-bydate-test and 20news-bydate-train. Now, create another 20newsdataall directory and merge both the training and test data of the group. Now move to the home directory and execute the following command: mkdir /tmp/20newsdataall cp –R /20newsdata/*/* /tmp/20newsdataall Create a directory in Hadoop and save this data in HDFS: hadoopfs –mkdir /usr/hue/20newsdata hadoopfs –put /tmp/20newsdataall /usr/hue/20newsdata Mahout CVB will accept the data in the vector format. For this, first we will generate a sequence file from the directory as follows: bin/mahoutseqdirectory -i /user/hue/20newsdata/20newsdataall -o /user/hue/20newsdataseq-out Convert the sequence file to a sparse vector but, as discussed earlier, using the term frequency weight. bin/mahout seq2sparse -i /user/hue/20newsdataseq-out/part-m-00000 -o /user/hue/20newsdatavec -lnorm -nv -wtt Convert the sparse vector to the input form required by the CVB algorithm. bin/mahoutrowid -i /user/hue/20newsdatavec/tf-vectors –o /user/hue/20newsmatrix Convert the sparse vector to the input form required by CVB algorithm. bin/mahout cvb -i /user/hue/20newsmatrix/matrix –o /user/hue/ldaoutput–k 10 –x 20 –dict/user/hue/20newsdatavec/dictionary.file-0 –dt /user/hue/ldatopics –mt /user/hue/ldamodel The parameters used in the preceding command can be explained as follows:      -i: This is the input path of the document vector      -o: This is the output path of the topic term distribution      -k: This is the number of latent topics      -x: This is the maximum number of iterations      -dict: This is the term dictionary files      -dt: This is the output path of document—topic distribution      -mt: This is the model state path after each iteration The output of the preceding command can be seen as follows: Once the command finishes, you will get the information on the screen as follows: To view the output, run the following command : bin/mahout vectordump -i /user/hue/ldaoutput/ -d /user/hue/20newsdatavec/dictionary.file-0 -dtsequencefile -vs 10 -sort true -o /tmp/lda-output.txt The parameters used in the preceding command can be explained as follows:     -i: This is the input location of the CVB output     -d: This is the dictionary file location created during vector creation     -dt: This is the dictionary file type (sequence or text)     -vs: This is the vector size     -sort: This is the flag to put true or false     -o: This is the output location of local filesystem Now your output will be saved in the local filesystem. Open the file and you will see an output similar to the following: From the preceding screenshot you can see that after running the algorithm, you will get the term and probability of that. Summary In this article, we learned about model-based clustering, the Dirichlet process, and topic modeling. In model-based clustering, we tried to obtain the model from the data ,while the Dirichlet process is used to understand the data. Topic modeling helps us to identify the topics in an article or in a set of documents. We discussed how Mahout has implemented topic modeling using the latent Dirichlet process and how it is implemented in map reduce. We discussed how to use Mahout to find out the topic distribution on a set of documents. Resources for Article: Further resources on this subject: Learning Random Forest Using Mahout[article] Implementing the Naïve Bayes classifier in Mahout[article] Clustering [article]
Read more
  • 0
  • 0
  • 6232

article-image-regular-expressions-python-26-text-processing
Packt
13 Jan 2011
17 min read
Save for later

Regular Expressions in Python 2.6 Text Processing

Packt
13 Jan 2011
17 min read
Python 2.6 Text Processing: Beginners Guide Simple string matching Regular expressions are notoriously hard to read, especially if you're not familiar with the obscure syntax. For that reason, let's start simple and look at some easy regular expressions at the most basic level. Before we begin, remember that Python raw strings allow us to include backslashes without the need for additional escaping. Whenever you define regular expressions, you should do so using the raw string syntax. Time for action – testing an HTTP URL In this example, we'll check values as they're entered via the command line as a means to introduce the technology. We'll dive deeper into regular expressions as we move forward. We'll be scanning URLs to ensure our end users inputted valid data. Create a new file and name it number_regex.py. Enter the following code: import sys import re # Make sure we have a single URL argument. if len(sys.argv) != 2: print >>sys.stderr, "URL Required" sys.exit(-1) # Easier access. url = sys.argv[1] # Ensure we were passed a somewhat valid URL. # This is a superficial test. if re.match(r'^https?:/{2}w.+$', url): print "This looks valid" else: print "This looks invalid" Now, run the example script on the command line a few times, passing various different values to it on the command line. (text_processing)$ python url_regex.py http://www.jmcneil.net This looks valid (text_processing)$ python url_regex.py http://intranet This looks valid (text_processing)$ python url_regex.py http://www.packtpub.com This looks valid (text_processing)$ python url_regex.py https://store This looks valid (text_processing)$ python url_regex.py httpsstore This looks invalid (text_processing)$ python url_regex.py https:??store This looks invalid (text_processing)$ What just happened? We took a look at a very simple pattern and introduced you to the plumbing needed to perform a match test. Let's walk through this little example, skipping the boilerplate code. First of all, we imported the re module. The re module, as you probably inferred from the name, contains all of Python's regular expression support. Any time you need to work with regular expressions, you'll need to import the re module. Next, we read a URL from the command line and bind a temporary attribute, which makes for cleaner code. Directly below that, you should notice a line that reads re.match(r'^https?:/{2}w.+$', url). This line checks to determine whether the string referenced by the url attribute matches the ^https?:/{2}w.+$ pattern. If a match is found, we'll print a success message; otherwise, the end user would receive some negative feedback indicating that the input value is incorrect. This example leaves out a lot of details regarding HTTP URL formats. If you were performing validation on user input, one place to look would be http://formencode.org/. FormEncode is a HTML form-processing and data-validation framework written by Ian Bicking. Understanding the match function The most basic method of testing for a match is via the re.match function, as we did in the previous example. The match function takes a regular expression pattern and a string value. For example, consider the following snippet of code: Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'pattern', 'pattern') <_sre.SRE_Match object at 0x1004811d0> >>> Here, we simply passed a regular expression of "pattern" and a string literal of "pattern" to the re.match function. As they were identical, the result was a match. The returned Match object indicates the match was successful. The re.match function returns None otherwise. >>> re.match(r'pattern', 'failure') >>> Learning basic syntax A regular expression is generally a collection of literal string data and special metacharacters that represents a pattern of text. The simplest regular expression is just literal text that only matches itself. In addition to literal text, there are a series of special characters that can be used to convey additional meaning, such as repetition, sets, wildcards, and anchors. Generally, the punctuation characters field this responsibility. Detecting repetition When building up expressions, it's useful to be able to match certain repeating patterns without needing to duplicate values. It's also beneficial to perform conditional matches. This lets us check for content such as "match the letter a, followed by the number one at least three times, but no more than seven times." For example, the code below does just that: Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'^a1{3,7}$', 'a1111111') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^a1{3,7}$', '1111111') >>> If the repetition operator follows a valid regular expression enclosed in parenthesis, it will perform repetition on that entire expression. For example: >>> re.match(r'^(a1){3,7}$', 'a1a1a1') <_sre.SRE_Match object at 0x100493918> >>> re.match(r'^(a1){3,7}$', 'a11111') >>> The following table details all of the special characters that can be used for marking repeating values within a regular expression. Specifying character sets and classes In some circumstances, it's useful to collect groups of characters into a set such that any of the values in the set will trigger a match. It's also useful to match any character at all. The dot operator does just that. A character set is enclosed within standard square brackets. A set defines a series of alternating (or) entities that will match a given text value. If the first character within a set is a caret (^) then a negation is performed. All characters not defined by that set would then match. There are a couple of additional interesting set properties. For ranged values, it's possible to specify an entire selection using a hyphen. For example, '[0-6a-d]' would match all values between 0 and 6, and a and d. Special characters listed within brackets lose their special meaning. The exceptions to this rule are the hyphen and the closing bracket. If you need to include a closing bracket or a hyphen within a regular expression, you can either place them as the first elements in the set or escape them by preceding them with a backslash. As an example, consider the following snippet, which matches a string containing a hexadecimal number. Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'^0x[a-f0-9]+$', '0xff') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^0x[a-f0-9]+$', '0x01') <_sre.SRE_Match object at 0x1004816b0> >>> re.match(r'^0x[a-f0-9]+$', '0xz') >>> In addition to the bracket notation, Python ships with some predefined classes. Generally, these are letter values prefixed with a backslash escape. When they appear within a set, the set includes all values for which they'll match. The d escape matches all digit values. It would have been possible to write the above example in a slightly more compact manner. >>> re.match(r'^0x[a-fd]+$', '0x33') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^0x[a-fd]+$', '0x3f') <_sre.SRE_Match object at 0x1004816b0> >>> The following table outlines all of the character sets and classes available: One thing that should become apparent is that lowercase classes are matches whereas their uppercase counterparts are the inverse. Applying anchors to restrict matches There are times where it's important that patterns match at a certain position within a string of text. Why is this important? Consider a simple number validation test. If a user enters a digit, but mistakenly includes a trailing letter, an expression checking for the existence of a digit alone will pass. Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> import re >>> re.match(r'd', '1f') <_sre.SRE_Match object at 0x1004811d0> >>> Well, that's unexpected. The regular expression engine sees the leading '1' and considers it a match. It disregards the rest of the string as we've not instructed it to do anything else with it. To fix the problem that we have just seen, we need to apply anchors. >>> re.match(r'^d$', '6') <_sre.SRE_Match object at 0x100481648> >>> re.match(r'^d$', '6f') >>> Now, attempting to sneak in a non-digit character results in no match. By preceding our expression with a caret (^) and terminating it with a dollar sign ($), we effectively said "between the start and the end of this string, there can only be one digit." Anchors, among various other metacharacters, are considered zero-width matches. Basically, this means that a match doesn't advance the regular expression engine within the test string. We're not limited to the either end of a string, either. Here's a collection of all of the available anchors provided by Python. Wrapping it up Now that we've covered the basics of regular expression syntax, let's double back and take a look at the expression we used in our first example. It might be a bit easier if we break it down a bit more with a diagram. Now that we've provided a bit of background, this pattern should make sense. We begin the regular expression with a caret, which matches the beginning of the string. The very next element is the literal http. As our caret matches the start of a string and must be immediately followed by http, this is equivalent to saying that our string must start with http. Next, we include a question mark after the s in https. The question mark states that the previous entity should be matched either zero, or one time. By default, the evaluation engine is looking character-by-character, so the previous entity in this case is simply "s." We do this so our test passes for both secure and non-secure addresses. As we advanced forward in our string, the next special term we run into is {2}, and it follows a simple forward slash. This says that the forward slash should appear exactly two times. Now, in the real world, it would probably make more sense to simply type the second slash. Using the repetition check like this not only requires more typing, but it also causes the regular expression engine to work harder. Immediately after the repetition match, we include a w. The w, if you'll remember from the previous tables, expands to [0-9a-zA-Z_], or any word character. This is to ensure that our URL doesn't begin with a special character. The dot character after the w matches anything, except a new line. Essentially, we're saying "match anything else, we don't so much care." The plus sign states that the preceding wild card should match at least once. Finally, we're anchoring the end of the string. However, in this example, this isn't really necessary. Have a go hero – tidying up our URL test There are a few intentional inconsistencies and problems with this regular expression as designed. To name a few: Properly formatted URLs should only contain a few special characters. Other values should be URL-encoded using percent escapes. This regular expression doesn't check for that. It's possible to include newline characters towards the end of the URL, which is clearly not supported by any browsers! The w followed by the. + implicitly set a minimum limit of two characters after the protocol specification. A single letter is perfectly valid. You guessed it. Using what we've covered thus far, it should be possible for you to backtrack and update our regular expression in order to fix these flaws. For more information on what characters are allowed, have a look at http://www.w3schools.com/tags/ref_urlencode.asp. Advanced pattern matching In addition to basic pattern matching, regular expressions let us handle some more advanced situations as well. It's possible to group characters for purposes of precedence and reference, perform conditional checks based on what exists later, or previously, in a string, and limit exactly how much of a match actually constitutes a match. Don't worry; we'll clarify that last phrase as we move on. Let's go! Grouping When crafting a regular expression string, there are generally two reasons you would wish to group expression components together: entity precedence or to enable access to matched parts later in your application. Time for action – regular expression grouping In this example, we'll return to our LogProcessing application. Here, we'll update our log split routines to divide lines up via a regular expression as opposed to simple string manipulation. In core.py, add an import re statement to the top of the file. This makes the regular expression engine available to us. Directly above the __init__ method definition for LogProcessor, add the following lines of code. These have been split to avoid wrapping. _re = re.compile( r'^([d.]+) (S+) (S+) [([w/:+ ]+)] "(.+?)" ' r'(?P<rcode>d{3}) (S+) "(S+)" "(.+)"') Now, we're going to replace the split method with one that takes advantage of the new regular expression: def split(self, line): """ Split a logfile. Uses a simple regular expression to parse out the Apache logfile entries. """ line = line.strip() match = re.match(self._re, line) if not match: raise ParsingError("Malformed line: " + line) return { 'size': 0 if match.group(6) == '-' else int(match.group(6)), 'status': match.group('rcode'), 'file_requested': match.group(5).split()[1] } Running the logscan application should now produce the same output as it did when we were using a more basic, split-based approach. (text_processing)$ cat example3.log | logscan -c logscan.cfg What just happened? First of all, we imported the re module so that we have access to Python's regular expression services. Next, at the LogProcessor class level, we defined a regular expression. Though, this time we did so via re.compile rather than a simple string. Regular expressions that are used more than a handful of times should be "prepared" by running them through re.compile first. This eases the load placed on the system by frequently used patterns. The re.compile function returns a SRE_Pattern object that can be passed in just about anywhere you can pass in a regular expression. We then replace our split method to take advantage of regular expressions. As you can see, we simply pass self._re in as opposed to a string-based regular expression. If we don't have a match, we raise a ParsingError, which bubbles up and generates an appropriate error message, much like we would see on an invalid split case. Now, the end of the split method probably looks somewhat peculiar to you. Here, we've referenced our matched values via group identification mechanisms rather than by their list index into the split results. Regular expression components surrounded by parenthesis create a group, which can be accessed via the group method on the Match object later down the road. It's also possible to access a previously matched group from within the same regular expression. Let's look at a somewhat smaller example. >>> match = re.match(r'(0x[0-9a-f]+) (?P<two>1)', '0xff 0xff') >>> match.group(1) '0xff' >>> match.group(2) '0xff' >>> match.group('two') '0xff' >>> match.group('failure') Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: no such group >>> Here, we surround two distinct regular expressions components with parenthesis, (0x[0-9a-f]+), and (?P&lttwo>1). The first regular expression matches a hexadecimal number. This becomes group ID 1. The second expression matches whatever was found by the first, via the use of the 1. The "backslash-one" syntax references the first match. So, this entire regular expression only matches when we repeat the same hexadecimal number twice, separated with a space. The ?P&lttwo> syntax is detailed below. As you can see, the match is referenced after-the-fact using the match.group method, which takes a numeric index as its argument. Using standard regular expressions, you'll need to refer to a matched group using its index number. However, if you'll look at the second group, we added a (?P&ltname>) construct. This is a Python extension that lets us refer to groupings by name, rather than by numeric group ID. The result is that we can reference groups of this type by name as opposed to simple numbers. Finally, if an invalid group ID is passed in, an IndexError exception is thrown. The following table outlines the characters used for building groups within a Python regular expression: Finally, it's worth pointing out that parenthesis can also be used to alter priority as well. For example, consider this code. >>> re.match(r'abc{2}', 'abcc') <_sre.SRE_Match object at 0x1004818b8> >>> re.match(r'a(bc){2}', 'abcc') >>> re.match(r'a(bc){2}', 'abcbc') <_sre.SRE_Match object at 0x1004937b0> >>> Whereas the first example matches c exactly two times, the second and third line require us to repeat bc twice. This changes the meaning of the regular expression from "repeat the previous character twice" to "repeat the previous match within parenthesis twice." The value within the group could have been its own complex regular expression, such as a([b-c]) {2}. Have a go hero – updating our stats processor to use named groups Spend a couple of minutes and update our statistics processor to use named groups rather than integer-based references. This makes it slightly easier to read the assignment code in the split method. You do not need to create names for all of the groups, simply the ones we're actually using will do. Using greedy versus non-greedy operators Regular expressions generally like to match as much text as possible before giving up or yielding to the next token in a pattern string. If that behavior is unexpected and not fully understood, it can be difficult to get your regular expression correct. Let's take a look at a small code sample to illustrate the point. Suppose that with your newfound knowledge of regular expressions, you decided to write a small script to remove the angled brackets surrounding HTML tags. You might be tempted to do it like this: >>> match = re.match(r'(?P<tag><.+>)', '<title>Web Page</title>') >>> match.group('tag') '<title>Web Page</title>' >>> The result is probably not what you expected. The reason we got this result was due to the fact that regular expressions are greedy by nature. That is, they'll attempt to match as much as possible. If you look closely, &lttitle> is a match for the supplied regular expression, as is the entire &lttitle&gtWeb Page</title> string. Both start with an angled-bracket, contain at least one character, and both end with an angled bracket. The fix is to insert the question mark character, or the non-greedy operator, directly after the repetition specification. So, the following code snippet fixes the problem. >>> match = re.match(r'(?P<tag><.+?>)', '<title>Web Page</title>') >>> match.group('tag') '<title>' >>> The question mark changes our meaning from "match as much as you possibly can" to "match only the minimum required to actually match."
Read more
  • 0
  • 0
  • 6212
article-image-make-things-pretty-ggplot2
Packt
24 Feb 2016
30 min read
Save for later

Make Things Pretty with ggplot2

Packt
24 Feb 2016
30 min read
 The objective of this article is to provide you with a general overview of the plotting environments in R and of the most efficient way of coding your graphs in it. We will go through the most important Integrated Development Environment (IDE) available for R as well as the most important packages available for plotting data; this will help you to get an overview of what is available in R and how those packages are compared with ggplot2. Finally, we will dig deeper into the grammar of graphics, which represents the basic concepts on which ggplot2 was designed. But first, let's make sure that you have a working version of R on your computer. (For more resources related to this topic, see here.) Getting ggplot2 up and running You can download the most up-to-date version of R from the R project website (http://www.r-project.org/). There, you will find a direct connection to the Comprehensive R Archive Network (CRAN), a network of FTP and web servers around the world that store identical, up-to-date versions of code and documentation for R. In addition to access to the CRAN servers, on the website of the R project, you may also find information about R, a few technical manuals, the R journal, and details about the packages developed for R and stored in the CRAN repositories. At the time of writing, the current version of R is 3.1.2. If you have already installed R on your computer, you can check the actual version with the R.Version() code, or for a more concise result, you can use the R.version.string code that recalls only part of the output of the previous function. Packages in R In the next few pages of this article, we will quickly go through the most important visualization packages available in R, so in order to try the code, you will also need to have additional packages as well as ggplot2 up and running in your R installation. In the basic R installation, you will already have the graphics package available and loaded in the session; the lattice package is already available among the standard packages delivered with the basic installation, but it is not loaded by default. ggplot2, on the other hand, will need to be installed. You can install and load a package with the following code: > install.packages(“ggplot2”) > library(ggplot2) Keep in mind that every time R is started, you will need to load the package you need with the library(name_of_the_package) command to be able to use the functions contained in the package. In order to get a list of all the packages installed on your computer, you can use the call to the library() function without arguments. If, on the other hand, you would like to have a list of the packages currently loaded in the workspace, you can use the search() command. One more function that can turn out to be useful when managing your library of packages is .libPaths(), which provides you with the location of your R libraries. This function is very useful to trace back the package libraries you are currently using, if any, in addition to the standard library of packages, which on Windows is located by default in a path of the kind C:/Program Files/R/R-3.1.2/library. The following list is a short recap of the functions just discussed: .libPaths()   # get library location library()   # see all the packages installed search()   # see the packages currently loaded Integrated Development Environment (IDE) You will definitely be able to run the code and the examples explained in the article directly from the standard R Graphical User Interface (GUI), especially if you are frequently working with R in more complex projects or simply if you like to keep an eye on the different components of your code, such as scripts, plots, and help pages, you may well think about the possibility of using an IDE. The number of specific IDEs that get integrated with R is still limited, but some of them are quite efficient, well-designed and open source. RStudio RStudio (http://www.rstudio.com/) is a very nice and advanced programming environment developed specifically for R, and this would be my recommended choice of IDE as the R programming environment in most cases. It is available for all the major platforms (Windows, Linux, and Mac OS X), and it can be run on a local machine, such as your computer, or even over the Web, using RStudio Server. With RStudio Server, you can connect a browser-based interface (the RStudio IDE) to a version of R running on a remote Linux server. RStudio allows you to integrate several useful functionalities, in particular if you use R for a more complex project. The way the software interface is organized allows you to keep an eye on the different activities you very often deal with in R, such as working on different scripts, overviewing the installed packages, as well as having easy access to the help pages and the plots generated. This last feature is particularly interesting for ggplot2 since in RStudio, you will be able to easily access the history of the plots created instead of visualizing only the last created plot, as is the case in the default R GUI. One other very useful feature of RStudio is code completion. You can, in fact, start typing a comment, and upon pressing the Tab key, the interface will provide you with functions matching what you have written . This feature will turn out to be very useful in ggplot2, so you will not necessarily need to remember all the functions and you will also have guidance for the arguments of the functions as well. In Figure 1.1, you can see a screenshot from the current version of RStudio (v 0.98.1091): Figure 1.1: This is a screenshot of RStudio on Windows 8 The environment is composed of four different areas: Scripting area: In this area you can open, create, and write the scripts. Console area: This area is the actual R console in which the commands are executed. It is possible to type commands directly here in the console or write them in a script and then run them on the console (I would recommend the last option). Workspace/History area: In this area, you can find a practical summary of all the objects created in the workspace in which you are working and the history of the typed commands. Visualization area: Here, you can easily load packages, open R help files, and, even more importantly, visualize plots. The RStudio website provides a lot of material on how to use the program, such as manuals, tutorials, and videos, so if you are interested, refer to the website for more details. Eclipse and StatET Eclipse (http://www.eclipse.org/) is a very powerful IDE that was mainly developed in Java and initially intended for Java programming. Subsequently, several extension packages were also developed to optimize the programming environment for other programming languages, such as C++ and Python. Thanks to its original objective of being a tool for advanced programming, this IDE is particularly intended to deal with very complex programming projects, for instance, if you are working on a big project folder with many different scripts. In these circumstances, Eclipse could help you to keep your programming scripts in order and have easy access to them. One drawback of such a development environment is probably its big size (around 200 MB) and a slightly slow-starting environment. Eclipse does not support interaction with R natively, so in order to be able to write your code and execute it directly in the R console, you need to add StatET to your basic Eclipse installation. StatET (http://www.walware.de/goto/statet) is a plugin for the Eclipse IDE, and it offers a set of tools for R coding and package building. More detailed information on how to install Eclipse and StatET and how to configure the connections between R and Eclipse/StatET can be found on the websites of the related projects. Emacs and ESS Emacs (http://www.gnu.org/software/emacs/) is a customizable text editor and is very popular, particularly in the Linux environment. Although this text editor appears with a very simple GUI, it is an extremely powerful environment, particularly thanks to the numerous keyboard shortcuts that allow interaction with the environment in a very efficient manner after getting some practice. Also, if the user interface of a typical IDE, such as RStudio, is more sophisticated and advanced, Emacs may be useful if you need to work with R on systems with a poor graphical interface, such as servers and terminal windows. Like Eclipse, Emacs does not support interfacing with R by default, so you will need to install an add-on package on your Emacs that will enable such a connection, Emacs Speaks Statistics (ESS). ESS (http://ess.r-project.org/) is designed to support the editing of scripts and interacting with various statistical analysis programs including R. The objective of the ESS project is to provide efficient text editor support to statistical software, which in some cases comes with a more or less defined GUI, but for which the real power of the language is only accessible through the original scripting language. The plotting environments in R R provides a complete series of options to realize graphics, which makes it quite advanced with regard to data visualization. Along the next few sections of this article, we will go through the most important R packages for data visualization by quickly discussing some high-level differences and analogies. If you already have some experience with other R packages for data visualization, in particular graphics or lattice, the following sections will provide you with some references and examples of how the code used in such packages appears in comparison with that used in ggplot2. Moreover, you will also have an idea of the typical layout of the plots created with a certain package, so you will be able to identify the tool used to realize the plots you will come across. The core of graphics visualization in R is within the grDevices package, which provides the basic structure of data plotting, such as the colors and fonts used in the plots. Such a graphic engine was then used as the starting point in the development of more advanced and sophisticated packages for data visualization, the most commonly used being graphics and grid. The graphics package is often referred to as the base or traditional graphics environment since, historically, it was the first package for data visualization available in R, and it provides functions that allow the generation of complete plots. The grid package, on the other hand, provides an alternative set of graphics tools. This package does not directly provide functions that generate complete plots, so it is not frequently used directly to generate graphics, but it is used in the development of advanced data visualization packages. Among the grid-based packages, the most widely used are lattice and ggplot2, although they are built by implementing different visualization approaches—Trellis plots in the case of lattice and the grammar of graphics in the case of ggplot2. We will describe these principles in more detail in the coming sections. A diagram representing the connections between the tools just mentioned is shown in Figure 1.2. Just keep in mind that this is not a complete overview of the packages available but simply a small snapshot of the packages we will discuss. Many other packages are built on top of the tools just mentioned, but in the following sections, we will focus on the most relevant packages used in data visualization, namely graphics, lattice, and, of course, ggplot2. If you would like to get a more complete overview of the graphics tools available in R, you can have a look at the web page of the R project summarizing such tools, http://cran.r-project.org/web/views/Graphics.html. Figure 1.2: This is an overview of the most widely used R packages for graphics In order to see some examples of plots in graphics, lattice and ggplot2, we will go through a few examples of different plots over the following pages. The objective of providing these examples is not to do an exhaustive comparison of the three packages but simply to provide you with a simple comparison of how the different codes as well as the default plot layouts appear for these different plotting tools. For these examples, we will use the Orange dataset available in R; to load it in the workspace, simply write the following code: >data(Orange) This dataset contains records of the growth of orange trees. You can have a look at the data by recalling its first lines with the following code: >head(Orange) You will see that the dataset contains three columns. The first one, Tree, is an ID number indicating the tree on which the measurement was taken, while age and circumference refer to the age in days and the size of the tree in millimeters, respectively. If you want to have more information about this data, you can have a look at the help page of the dataset by typing the following code: ?Orange Here, you will find the reference of the data as well as a more detailed description of the variables included. Standard graphics and grid-based graphics The existence of these two different graphics environments brings these questions  to most users' minds—which package to use and under which circumstances? For simple and basic plots, where the data simply needs to be represented in a standard plot type (such as a scatter plot, histogram, or boxplot) without any additional manipulation, then all the plotting environments are fairly equivalent. In fact, it would probably be possible to produce the same type of plot with graphics as well as with lattice or ggplot2. Nevertheless, in general, the default graphic output of ggplot2 or lattice will be most likely superior compared to graphics since both these packages are designed considering the principles of human perception deeply and to make the evaluation of data contained in plots easier. When more complex data should be analyzed, then the grid-based packages, lattice and ggplot2, present a more sophisticated support in the analysis of multivariate data. On the other hand, these tools require greater effort to become proficient because of their flexibility and advanced functionalities. In both cases, lattice and ggplot2, the package provides a full set of tools for data visualization, so you will not need to use grid directly in most cases, but you will be able to do all your work directly with one of those packages. Graphics and standard plots The graphics package was originally developed based on the experience of the graphics environment in R. The approach implemented in this package is based on the principle of the pen-on-paper model, where the plot is drawn in the first function call and once content is added, it cannot be deleted or modified. In general, the functions available in this package can be divided into high-level and low-level functions. High-level functions are functions capable of drawing the actual plot, while low-level functions are functions used to add content to a graph that was already created with a high-level function. Let's assume that we would like to have a look at how age is related to the circumference of the trees in our dataset Orange; we could simply plot the data on a scatter plot using the high-level function plot() as shown in the following code: plot(age~circumference, data=Orange) This code creates the graph in Figure 1.3. As you would have noticed, we obtained the graph directly with a call to a function that contains the variables to plot in the form of y~x, and the dataset to locate them. As an alternative, instead of using a formula expression, you can use a direct reference to x and y, using code in the form of plot(x,y). In this case, you will have to use a direct reference to the data instead of using the data argument of the function. Type in the following code: plot(Orange$circumference, Orange$age) The preceding code results in the following output: Figure 1.3: Simple scatterplot of the dataset Orange using graphics For the time being, we are not interested in the plot's details, such as the title or the axis, but we will simply focus on how to add elements to the plot we just created. For instance, if we want to include a regression line as well as a smooth line to have an idea of the relation between the data, we should use a low-level function to add the just-created additional lines to the plot; this is done with the lines() function: plot(age~circumference, data=Orange)   ###Create basic plot abline(lm(Orange$age~Orange$circumference), col=”blue”) lines(loess.smooth(Orange$circumference,Orange$age), col=”red”) The graph generated as the output of this code is shown in Figure 1.4: Figure 1.4: This is a scatterplot of the Orange data with a regression line (in blue) and a smooth line (in red) realized with graphics As illustrated, with this package, we have built a graph by first calling one function, which draws the main plot frame, and then additional elements were included using other functions. With graphics, only additional elements can be included in the graph without changing the overall plot frame defined by the plot() function. This ability to add several graphical elements together to create a complex plot is one of the fundamental elements of R, and you will notice how all the different graphical packages rely on this principle. If you are interested in getting other code examples of plots in graphics, there is also some demo code available in R for this package, and it can be visualized with demo(graphics). In the coming sections, you will find a quick reference to how you can generate a similar plot using graphics and ggplot2. As will be described in more detail later on, in ggplot2, there are two main functions to realize plots, ggplot() and qplot(). The function qplot() is a wrapper function that is designed to easily create basic plots with ggplot2, and it has a similar code to the plot() function of graphics. Due to its simplicity, this function is the easiest way to start working with ggplot2, so we will use this function in the examples in the following sections. The code in these sections also uses our example dataset Orange; in this way, you can run the code directly on your console and see the resulting output. Scatterplot with individual data points To generate the plot generated using graphics, use the following code: plot(age~circumference, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange) The preceding code results in the following output: Scatterplots with the line of one tree To generate the plot using graphics, use the following code: plot(age~circumference, data=Orange[Orange$Tree==1,], type=”l”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange[Orange$Tree==1,], geom=”line”) The preceding code results in the following output: Scatterplots with the line and points of one tree To generate the plot using graphics, use the following code: plot(age~circumference, data=Orange[Orange$Tree==1,], type=”b”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange[Orange$Tree==1,], geom=c(“line”,”point”)) The preceding code results in the following output: Boxplot of orange dataset To generate the plot using graphics, use the following code: boxplot(circumference~Tree, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=”boxplot”) The preceding code results in the following output: Boxplot with individual observations To generate the plot using graphics, use the following code: boxplot(circumference~Tree, data=Orange) points(circumference~Tree, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=c(“boxplot”,”point”)) The preceding code results in the following output: Histogram of orange dataset To generate the plot using graphics, use the following code: hist(Orange$circumference) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”) The preceding code results in the following output: Histogram with reference line at median value in red To generate the plot using graphics, use the following code: hist(Orange$circumference) abline(v=median(Orange$circumference), col=”red”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”)+geom_vline(xintercept = median(Orange$circumference), colour=”red”) The preceding code results in the following output: Lattice and the Trellis plots Along with with graphics, the base R installation also includes the lattice package. This package implements a family of techniques known as Trellis graphics, proposed by William Cleveland to visualize complex datasets with multiple variables. The objective of those design principles was to ensure the accurate and faithful communication of data information. These principles are embedded into the package and are already evident in the default plot design settings. One interesting feature of Trellis plots is the option of multipanel conditioning, which creates multiple plots by splitting the data on the basis of one variable. A similar option is also available in ggplot2, but in that case, it is called faceting. In lattice, we also have functions that are able to generate a plot with one single call, but once the plot is drawn, it is already final. Consequently, plot details as well as additional elements that need to be included in the graph, need to be specified already within the call to the main function. This is done by including all the specifications in the panel function argument. These specifications can be included directly in the main body of the function or specified in an independent function, which is then called; this last option usually generates more readable code, so this will be the approach used in the following examples. For instance, if we want to draw the same plot we just generated in the previous section with graphics, containing the age and circumference of trees and also the regression and smooth lines, we need to specify such elements within the function call. You may see an example of the code here; remember that lattice needs to be loaded in the workspace: require(lattice)              ##Load lattice if needed myPanel <- function(x,y){ panel.xyplot(x,y)            # Add the observations panel.lmline(x,y,col=”blue”)   # Add the regression panel.loess(x,y,col=”red”)      # Add the smooth line } xyplot(age~circumference, data=Orange, panel=myPanel) This code produces the plot in Figure 1.5: Figure 1.5: This is a scatter plot of the Orange data with the regression line (in blue) and the smooth line (in red) realized with lattice As you would have noticed, taking aside the code differences, the plot generated does not look very different from the one obtained with graphics. This is because we are not using any special visualization feature of lattice. As mentioned earlier, with this package, we have the option of multipanel conditioning, so let's take a look at this. Let's assume that we want to have the same plot but for the different trees in the dataset. Of course, in this case, you would not need the regression or the smooth line, since there will only be one tree in each plot window, but it could be nice to have the different observations connected. This is shown in the following code: myPanel <- function(x,y){ panel.xyplot(x,y, type=”b”) #the observations } xyplot(age~circumference | Tree, data=Orange, panel=myPanel) This code generates the graph shown in Figure 1.6: Figure 1.6: This is a scatterplot of the Orange data realized with lattice, with one subpanel representing the individual data of each tree. The number of trees in each panel is reported in the upper part of the plot area As illustrated, using the vertical bar |, we are able to obtain the plot conditional to the value of the variable Tree. In the upper part of the panels, you would notice the reference to the value of the conditional variable, which, in this case, is the column Tree. As mentioned before, ggplot2 offers this option too; we will see one example of that in the next section. In the next section, You would find a quick reference to how to convert a typical plot type from lattice to ggplot2. In this case, the examples are adapted to the typical plotting style of the lattice plots. Scatterplot with individual observations To plot the graph using lattice, use the following code: xyplot(age~circumference, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange) The preceding code results in the following output: Scatterplot of orange dataset with faceting To plot the graph using lattice, use the following code: xyplot(age~circumference|Tree, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange, facets=~Tree) The preceding code results in the following output: Faceting scatterplot with line and points To plot the graph using lattice, use the following code: xyplot(age~circumference|Tree, data=Orange, type=”b”) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange, geom=c(“line”,”point”), facets=~Tree) The preceding code results in the following output: Scatterplots with grouping data To plot the graph using lattice, use the following code: xyplot(age~circumference, data=Orange, groups=Tree, type=”b”) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange,color=Tree, geom=c(“line”,”point”)) The preceding code results in the following output: Boxplot of orange dataset To plot the graph using lattice, use the following code: bwplot(circumference~Tree, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=”boxplot”) The preceding code results in the following output: Histogram of orange dataset To plot the graph using lattice, use the following code: histogram(Orange$circumference, type = “count”) To plot the graph using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”) The preceding code results in the following output: Histogram with reference line at median value in red To plot the graph using lattice, use the following code: histogram(~circumference, data=Orange, type = “count”, panel=function(x,...){panel.histogram(x, ...);panel.abline(v=median(x), col=”red”)}) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”)+geom_vline(xintercept = median(Orange$circumference), colour=”red”) The preceding code results in the following output: ggplot2 and the grammar of graphics The ggplot2 package was developed by Hadley Wickham by implementing a completely different approach to statistical plots. As is the case with lattice, this package is also based on grid, providing a series of high-level functions that allow the creation of complete plots. The Grammar of Graphics by Leland Wilkinson. Briefly, The Grammar of Graphics assumes that a statistical graphic is a mapping of data to the aesthetic attributes and geometric objects used to represent data, such as points, lines, bars, and so on. Besides the aesthetic attributes, the plot can also contain statistical transformation or grouping of data. As in lattice, in ggplot2, we have the possibility of splitting data by a certain variable, obtaining a representation of each subset of data in an independent subplot; such representation in ggplot2 is called faceting. In a more formal way, the main components of the grammar of graphics are the data and its mapping, aesthetics, geometric objects, statistical transformations, scales, coordinates, and faceting: The data that must be visualized is mapped to aesthetic attributes, which define how the data should be perceived Geometric objects describe what is actually displayed on the plot, such as lines, points, or bars; the geometric objects basically define which kind of plot you are going to draw Statistical transformations are applied to the data to group them; examples of statistical transformations would be the smooth line or the regression lines of the previous examples or the binning of the histograms Scales represent the connection between the aesthetic spaces and the actual values that should be represented. Scales may also be used to draw legends Coordinates represent the coordinate system in which the data is drawn Faceting, which we have already mentioned, is the grouping of data in subsets defined by a value of one variable In ggplot2, there are two main high-level functions capable of directly creating a plot, qplot(), and ggplot(); qplot() stands for quick plot, and it is a simple function that serves a purpose similar to that served by the plot() function in graphics. The ggplot()function, on the other hand, is a much more advanced function that allows the user to have more control of the plot layout and details. In our journey into the world of ggplot2, we will see some examples of qplot(), in particular when we go through the different kinds of graphs, but we will dig a lot deeper into ggplot() since this last function is more suited to advanced examples. If you have a look at the different forums based on R programming, there is quite a bit of discussion as to which of these two functions would be more convenient to use. My general recommendation would be that it depends on the type of graph you are drawing more frequently. For simple and standard plots, where only the data should be represented and only the minor modification of standard layouts are required, the qplot() function will do the job. On the other hand, if you need to apply particular transformations to the data or if you would just like to keep the freedom of controlling and defining the different details of the plot layout, I would recommend that you focus on ggplot(). As you will see, the code between these functions is not completely different since they are both based on the same underlying philosophy, but the way in which the options are set is quite different, so if you want to adapt a plot from one function to the other, you will essentially need to rewrite your code. If you just want to focus on learning only one of them, I would definitely recommend that you learn ggplot(). In the following code, you will see an example of a plot realized with ggplot2, where you can identify some of the components of the grammar of graphics. The example is realized with the ggplot() function, which allows a more direct comparison with the grammar of graphics, but coming just after the following code, you could also find the corresponding qplot() code useful. Both codes generate the graph depicted in Figure 1.7: require(ggplot2)                             ## Load ggplot2 data(Orange)                                 ## Load the data   ggplot(data=Orange,                          ## Data used   aes(x=circumference,y=age, color=Tree))+   ## Aesthetic geom_point()+                                ## Geometry stat_smooth(method=”lm”,se=FALSE)            ## Statistics   ### Corresponding code with qplot() qplot(circumference,age,data=Orange,         ## Data used   color=Tree,                                ## Aesthetic mapping   geom=c(“point”,”smooth”),method=”lm”,se=FALSE) This simple example can give you an idea of the role of each portion of code in a ggplot2 graph; you have seen how the main function body creates the connection between the data and the aesthetics we are interested to represent and how, on top of this, you add the components of the plot, as in this case, we added the geometry element of points and the statistical element of regression. You can also notice how the components that need to be added to the main function call are included using the + sign. One more thing worth mentioning at this point is that if you run just the main body function in the ggplot() function, you will get an error message. This is because this call is not able to generate an actual plot. The step during which the plot is actually created is when you include the geometric attribute, which, in this case is geom_point(). This is perfectly in line with the grammar of graphics since, as we have seen, the geometry represents the actual connection between the data and what is represented on the plot. This is the stage where we specify that the data should be represented as points; before that, nothing was specified about which plot we were interested in drawing. Figure 1.7: This is an example of plotting the Orange dataset with ggplot2 Summary To learn more about the similar technology, the following books/videos published by Packt Publishing (https://www.packtpub.com/) are recommended: ggplot2 Essentials (https://www.packtpub.com/big-data-and-business-intelligence/ggplot2-essentials) Video: Building Interactive Graphs with ggplot2 and Shiny (https://www.packtpub.com/big-data-and-business-intelligence/building-interactive-graphs-ggplot2-and-shiny-video) Resources for Article: Further resources on this subject: Refresher [article] Interactive Documents [article] Driving Visual Analyses Automobile Data (Python) [article]
Read more
  • 0
  • 0
  • 6191

article-image-protecting-your-bitcoins
Packt
29 Oct 2015
32 min read
Save for later

Protecting Your Bitcoins

Packt
29 Oct 2015
32 min read
In this article by Richard Caetano author of the book Learning Bitcoin, we will explore ways to safely hold your own bitcoin. We will cover the following topics: Storing your bitcoins Working with brainwallet Understanding deterministic wallets Storing Bitcoins in cold storage Good housekeeping with Bitcoin (For more resources related to this topic, see here.) Storing your bitcoins The banking system has a legacy of offering various financial services to its customers. They offer convenient ways to spend money, such as cheques and credit cards, but the storage of money is their base service. For many centuries, banks have been a safe place to keep money. Customers rely on the interest paid on their deposits, as well as on the government insurance against theft and insolvency. Savings accounts have helped make preserving the wealth easy, and accessible to a large population in the western world. Yet, some people still save a portion of their wealth as cash or precious metals, usually in a personal safe at home or in a safety deposit box. They may be those who have, over the years, experienced or witnessed the downsides of banking: government confiscation, out of control inflation, or runs on the bank. Furthermore, a large population of the world does not have access to the western banking system. For those who live in remote areas or for those without credit, opening a bank account is virtually impossible. They must handle their own money properly to prevent loss or theft. In some places of the world, there can be great risk involved. These groups of people, who have little or no access to banking, are called the "underbanked". For the underbanked population, Bitcoin offers immediate access to a global financial system. Anyone with access to the internet or who carries a mobile phone with the ability to send and receive SMS messages, can hold his or her own bitcoin and make global payments. They can essentially become their own bank. However, you must understand that Bitcoin is still in its infancy as a technology. Similar to the Internet of circa 1995, it has demonstrated enormous potential, yet lacks usability for a mainstream audience. As a parallel, e-mail in its early days was a challenge for most users to set up and use, yet today it's as simple as entering your e-mail address and password on your smartphone. Bitcoin has yet to develop through these stages. Yet, with some simple guidance, we can already start realizing its potential. Let's discuss some general guidelines for understanding how to become your own bank. Bitcoin savings In most normal cases, we only keep a small amount of cash in our hand wallets to protect ourselves from theft or accidental loss. Much of our money is kept in checking or savings accounts with easy access to pay our bills. Checking accounts are used to cover our rent, utility bills, and other payments, while our savings accounts hold money for longer-term goals, such as a down payment on buying a house. It's highly advisable to develop a similar system for managing your Bitcoin money. Both local and online wallets provide a convenient way to access your bitcoins for day-to-day transactions. Yet there is the unlikely risk that one could lose his or her Bitcoin wallet due to an accidental computer crash or faulty backup. With online wallets, we run the risk of the website or the company becoming insolvent, or falling victim to cybercrime. By developing a reliable system, we can adopt our own personal 'Bitcoin Savings' account to hold our funds for long-term storage. Usually, these savings are kept offline to protect them from any kind of computer hacking. With protected access to our offline storage, we can periodically transfer money to and from our savings. Thus, we can arrange our Bitcoin funds much as we manage our money with our hand wallets and checking/savings accounts. Paper wallets As explained, a private key is a large random number that acts as the key to spend your bitcoins. A cryptographic algorithm is used to generate a private key and, from it, a public address. We can share the public address to receive bitcoins, and, with the private key, spend the funds sent to the address. Generally, we rely on our Bitcoin wallet software to handle the creation and management of our private keys and public addresses. As these keys are stored on our computers and networks, they are vulnerable to hacking, hardware failures, and accidental loss. Private keys and public addresses are, in fact, just strings of letters and numbers. This format makes it easy to move the keys offline for physical storage. Keys printed on paper are called "paper wallet" and are highly portable and convenient to store in a physical safe or a bank safety deposit box. With the private key generated and stored offline, we can safely send bitcoin to its public address. A paper wallet must include at least one private key and its computed public address. Additionally, the paper wallet can include a QR code to make it convenient to retrieve the key and address. Figure 3.1 is an example of a paper wallet generated by Coinbase: Figure 3.1 - Paper wallet generated from Coinbase The paper wallet includes both the public address (labeled Public key) and the private key, both with QR codes to easily transfer them back to your online wallet. Also included on the paper wallet is a place for notes. This type of wallet is easy to print for safe storage. It is recommended that copies are stored securely in multiple locations in case the paper is destroyed. As the private key is shown in plain text, anyone who has access to this wallet has access to the funds. Do not store your paper wallet on your computer. Loss of the paper wallet due to hardware failure, hacking, spyware, or accidental loss can result in complete loss of your bitcoin. Make sure you have multiple copies of your wallet printed and securely stored before transferring your money. One time use paper wallets Transactions from bitcoin addresses must include the full amount. When sending a partial amount to a recipient, the remaining balance must be sent to a change address. Paper wallet that includes only one private key are considered to be "one time use" paper wallet. While you can always send multiple transfers of bitcoin to the wallet, it is highly recommended that you spend the coins only once. Therefore, you shouldn't move a large number of bitcoins to the wallet expecting to spend a partial amount. With this in mind, when using one-time use paper wallet, it's recommended that you only save a usable amount to each wallet. This amount could be a block of coins that you'd like to fully redeem to your online wallet. Creating a paper wallet To create a paper wallet in Coinbase, simply log in with your username and password. Click on the Tools link on the left-hand side menu. Next, click on the Paper Wallets link from the above menu. Coinbase will prompt you to Generate a paper wallet and Import a paper wallet. Follow the links to generate a paper wallet. You can expect to see the paper wallet rendered, as shown in the following figure 3.2: Figure 3.2 - Creating a paper wallet with Coinbase Coinbase generates your paper wallet completely from your browser, without sending the private key back to its server. This is important to protect your private key from exposure to the network. You are generating the only copy of your private key. Make sure that you print and securely store multiple copies of your paper wallet before transferring any money to it. Loss of your wallet and private key will result in the loss of your bitcoin. By clicking the Regenerate button, you can generate multiple paper wallets and store various amounts of bitcoin on each wallet. Each wallet is easily redeemable in full at Coinbase or with other bitcoin wallet services. Verifying your wallet's balance After generating and printing multiple copies of your paper wallet, you're ready to transfer your funds. Coinbase will prompt you with an easy option to transfer the funds from your Coinbase wallet to your paper wallet: Figure 3.3 - Transferring funds to your paper wallet Figure 3.3 shows Coinbase's prompt to transfer your funds. It provides options to enter your amount in BTC or USD. Simply specify your amount and click Send. Note that Coinbase only keeps a copy of your public address. You can continue to send additional amounts to your paper wallet using the same public address. For your first time working with paper wallets, it's advisable that you only send small amounts of bitcoin, to learn and experiment with the process. Once you feel comfortable with creating and redeeming paper wallets, you can feel secure with transferring larger amounts. To verify that the funds have been moved to your paper wallet, we can use a blockchain explorer to verify that the funds have been confirmed by the network. Blockchain explorers make all the transaction data from the Bitcoin network available for public review. We'll use a service called Blockchain.info to verify our paper wallet. Simply open www.blockchain.info in your browser and enter the public key from your paper wallet in the search box. If found, Blockchain.info will display a list of the transaction activities on that address: Figure 3.4 - Blockchain.info showing transaction activity Shown in figure 3.4 is the transaction activity for the address starting with 16p9Lt. You can quickly see the total bitcoin received and the current balance. Under the Transactions section, you can find the details of the transactions recorded by the network. Also listed are the public addresses that were combined by the wallet software, as well as the change address used to complete the transfer. Note that at least six confirmations are required before the transaction is considered confirmed. Importing versus sweeping When importing your private key, the wallet software will simply add the key to its list of private keys. Your bitcoin wallet will manage your list of private keys. When sending money, it will combine the balances from multiple addresses to make the transfer. Any remaining amount will be sent back to the change address. The wallet software will automatically manage your change addresses. Some Bitcoin wallets offer the ability to sweep yourc private key. This involves a second step. After importing your private key, the wallet software will make a transaction to move the full balance of your funds to a new address. This process will empty your paper wallet completely. The step to transfer the funds may require additional time to allow the network to confirm your transaction. This process could take up to one hour. In addition to the confirmation time, a small miner's fee may be applied. This fee could be in the amount of 0.0001BTC. If you are certain that you are the only one with access to the private key, it is safe to use the import feature. However, if you believe someone else may have access to the private key, sweeping is highly recommended. Listed in the following table are some common bitcoin wallets which support importing a private key: Bitcoin Wallet Comments Sweeping Coinbase https://www.coinbase.com/ This provides direct integration between your online wallet and your paper wallet. No Electrum https://electrum.org This provides the ability to import and see your private key for easy access to your wallet's funds. Yes Armory https://bitcoinarmory.com/ This provides the ability to import your private key or "sweep" the entire balance. Yes Multibit https://multibit.org/ This directly imports your private key. It may use a built-in address generator for change addresses. No Table 1 - Wallets that support importing private keys Importing your Paper wallet To import your wallet, simply log into your Coinbase account. Click on Tools from the left-hand side menu, followed by Paper Wallet from the top menu. Then, click on the Import a paper wallet button. You will be prompted to enter the private key of your paper wallet, as show in figure 3.5: Figure 3.5 - Coinbase importing from a paper wallet Simply enter the private key from your paper wallet. Coinbase will validate the key and ask you to confirm your import. If accepted, Coinbase will import your key and sweep your balance. The full amount will be transferred to your bitcoin wallet and become available after six confirmations. Paper wallet guidelines Paper wallets display your public and private keys in plain text. Make sure that you keep these documents secure. While you can send funds to your wallet multiple times, it is highly recommended that you spend your balance only once and in full. Before sending large amounts of bitcoin to a paper wallet, make sure you are able to test your ability to generate and import the paper wallet with small amounts of bitcoin. When you're comfortable with the process, you can rely on them for larger amounts. As paper is easily destroyed or ruined, make sure that you keep multiple copies of your paper wallet in different locations. Make sure the location is secure from unwanted access. Be careful with online wallet generators. A malicious site operator can obtain the private key from your web browser. Only use trusted paper wallet generators. You can test the online paper wallet generator by opening the page in your browser while online, and then disconnecting your computer from the network. You should be able to generate your paper wallet when completely disconnected from the network, ensuring that your private keys are never sent back to the network. Coinbase is an exception in the fact that it only sends the public address back to the server for reference. This public address is saved to make it easy to transfer funds to your paper wallet. The private key is never saved by Coinbase when generating a paper wallet. Paper wallet services In addition to the services mentioned, there are other services that make paper wallets easy to generate and print. Listed next in Table 2 are just a few: Service Notes BitAddress bitaddress.org This offers the ability to generate single wallets, bulk wallets, brainwallets, and more. Bitcoin Paper Wallet bitcoinpaperwallet.com This offers nice, stylish design, and easy-to-use features. Users can purchase holographic stickers securing the paper wallets. Wallet Generator walletgenerator.net This offers printable paper wallets that fold nicely to conceal the private keys.  Table 2 - Services for generating paper wallets and brainwallets Brainwallets Storing our private keys offline by using a paper wallet is one way we can protect our coins from attacks on the network. Yet, having a physical copy of our keys is similar to holding a gold bar: it's still vulnerable to theft if the attacker can physically obtain the wallet. One way to protect bitcoins from online or offline theft is to have the codes recallable by memory. As holding long random private keys in memory is quite difficult, even for the best of minds, we'll have to use another method to generate our private keys. Creating a brainwallet Brainwallet is a way to create one or more private keys from a long phrase of random words. From the phrase, called a passphrase, we're able to generate a private key, along with its public addresses, to store bitcoin. We can create any passphrase we'd like. The longer the phrase and the more random the characters, the more secure it will be. Brainwallet phrases should contain at least 12 words. It is very important that the phrase should never come from anything published, such as a book or a song. Hackers actively search for possible brainwallets by performing brute force attacks on commonly-published phrases. Here is an example of a brainwallet passphrase: gently continue prepare history bowl shy dog accident forgive strain dirt consume Note that the phrase is composed of 12 seemingly random words. One could use an easy-to-remember sentence rather than 12 words. Regardless of whether you record your passphrase on paper or memorize it, the idea is to use a passphrase that's easy to recall and type, yet difficult to crack. Don't let this happen to you: "Just lost 4 BTC out of a hacked brain wallet. The pass phrase was a line from an obscure poem in Afrikaans. Somebody out there has a really comprehensive dictionary attack program running." Reddit Thread (http://redd.it/1ptuf3) Unfortunately, this user lost their bitcoin because they chose a published line from a poem. Make sure that you choose a passphrase that is composed of multiple components of non-published text. Sadly, although warned, some users may resort to simple phrases that are easy to crack. Simple passwords such as 123456, password1, and iloveyou are still commonly used with e-mails, and login accounts are routinely cracked. Do not use simple passwords for your brainwallet passphrase. Make sure that you use at least 12 words with additional characters and numbers. Using the preceding paraphrase, we can generate our private key and public address using the many tools available online. We'll use an online service called BitAddress to generate the actual brainwallet from the passphrase. Simply open www.bitaddress.org in your browser. At first, BitAddress will ask you to move your mouse cursor around to collect enough random points to generate a seed for generating random numbers. This process could take a minute or two. Once opened, select the option Brain Wallet from the top menu. In the form presented, enter the passphrase and then enter it again to confirm. Click on View to see your private key and public address. For the example shown in figure 3.6, we'll use the preceding passphrase example: Figure 3.6 - BitAddress's brainwallet feature From the page, you can easily copy and paste the public address and use it for receiving Bitcoin. Later, when you're ready to spend the coins, enter the same exact passphrase to generate the same private key and public address. Referring to our Coinbase example from earlier in the article, we can then import the private key into our wallet. Increasing brainwallet Security As an early attempt to give people a way to "memorize" their Bitcoin wallet, brainwallets have become a target for hackers. Some users have chosen phrases or sentences from common books as their brainwallet. Unfortunately, the hackers who had access to large amounts of computing power were able to search for these phrases and were able to crack some brainwallets. To improve the security of brainwallets, other methods have been developed which make brainwallets more secure. One service, called brainwallet.io, executes a time-intensive cryptographic function over the brainwallet phrase to create a seed that is very difficult to crack. It's important to know that the phase phrases used with BitAddress are not compatible with brainwallet.io. To use brainwallet.io to generate a more secure brainwallet, open http://brainwallet.io: Figure 3.7 - brainwallet.io, a more secure brainwallet generator Brainwallet.io needs a sufficient amount of entropy to generate a private key which is difficult to reproduce. Entropy, in computer science, can describe data in terms of its predictability. When data has high entropy, it could mean that it's difficult to reproduce from known sources. When generating private keys, it's very important to use data that has high entropy. For generating brainwallet keys, we need data with high entropy, yet it should be easy for us to duplicate. To meet this requirement, brainwallet.io accepts your random passphrase, or can generate one from a list of random words. Additionally, it can use data from a file of your choice. Either way, the more entropy given, the stronger your passphrase will be. If you specify a passphrase, choosing at least 12 words is recommended. Next, brainwallet.io prompts you for salt, available in several forms: login info, personal info, or generic. Salts are used to add additional entropy to the generation of your private key. Their purpose is to prevent standard dictionary attacks against your passphrase. While using brainwallet.io, this information is never sent to the server. When ready, click the generate button, and the page will begin computing a scrypt function over your passphrase. Scrypt is a cryptographic function that requires computing time to execute. Due to the time required for each pass, it makes brute force attacks very difficult. brainwallet.io makes many thousands of passes to ensure that a strong seed is generated for the private key. Once finished, your new private key and public address, along with their QR codes, will be displayed for easy printing. As an alternative, WarpWallet is also available at https://keybase.io/warp. WarpWallet also computes a private key based on many thousands of scrypt passes over a passphrase and salt combination. Remember that brainwallet.io passphrases are not compatible with WarpWallet passphrases. Deterministic wallets We have introduced brainwallets that yield one private key and public address. They are designed for one time use and are practical for holding a fixed amount of bitcoin for a period of time. Yet, if we're making lots of transactions, it would be convenient to have the ability to generate unlimited public addresses so that we can use them to receive bitcoin from different transactions or to generate change addresses. A Type 1 Deterministic Wallet is a simple wallet schema based on a passphrase with an index appended. By incrementing the index, an unlimited number of addresses can be created. Each new address is indexed so that its private key can be quickly retrieved. Creating a deterministic wallet To create a deterministic wallet, simply choose a strong passphrase, as previously described, and then append a number to represent an individual private key and public address. It's practical to do this with a spreadsheet so that you can keep a list of public addresses on file. Then, when you want to spend the bitcoin, you simply regenerate the private key using the index. Let's walk through an example. First, we choose the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become" Then, we append an index, sequential number, to the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become0" "dress retreat save scratch decide simple army piece scent ocean hand become1" "dress retreat save scratch decide simple army piece scent ocean hand become2" "dress retreat save scratch decide simple army piece scent ocean hand become3" "dress retreat save scratch decide simple army piece scent ocean hand become4" Then, we take each passphrase, with the corresponding index, and run it through brainwallet.io, or any other brainwallet service, to generate the public address. Using a table or a spreadsheet, we can pre-generate a list of public addresses to receive bitcoin. Additionally, we can add a balance column to help track our money: Index Public Address Balance 0 1Bc2WZ2tiodYwYZCXRRrvzivKmrGKg2Ub9 0.00 1 1PXRtWnNYTXKQqgcxPDpXEvEDpkPKvKB82 0.00 2 1KdRGNADn7ipGdKb8VNcsk4exrHZZ7FuF2 0.00 3 1DNfd491t3ABLzFkYNRv8BWh8suJC9k6n2 0.00 4 17pZHju3KL4vVd2KRDDcoRdCs2RjyahXwt 0.00  Table 3 - Using a spreadsheet to track deterministic wallet addresses Spending from a deterministic wallet When we have money available in our wallet to spend, we can simply regenerate the private key for the index matching the public address. For example, let's say we have received 2BTC on the address starting with 1KdRGN in the preceding table. Since we know it belongs to index #2, we can reopen the brainwallet from the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become2" Using brainwallet.io as our brainwallet service, we quickly regenerate the original private key and public address: Figure 3.8 - Private key re-generated from a deterministic wallet Finally, we import the private key into our Bitcoin wallet, as described earlier in the article. If we don't want to keep the change in our online wallet, we can simply send the change back to the next available public address in our deterministic wallet. Pre-generating public addresses with deterministic wallets can be useful in many situations. Perhaps you want to do business with a partner and want to receive 12 payments over the course of one year. You can simply regenerate the 12 addresses and keep track of each payment using a spreadsheet. Another example could apply to an e-commerce site. If you'd like to receive payment for the goods or services being sold, you can pre-generate a long list of addresses. Only storing the public addresses on your website protects you from malicious attack on your web server. While Type 1 deterministic wallets are very useful, we'll introduce a more advanced version called the Type 2 Hierarchical Deterministic Wallet next. Type 2 Hierarchical Deterministic wallets Type 2 Hierarchical Deterministic (HD) wallets function similarly to Type 1 deterministic wallets, as they are able to generate an unlimited amount of private keys from a single passphrase, but they offer more advanced features. HD wallets are used by desktop, mobile, and hardware wallets as a way of securing an unlimited number of keys by a single passphrase. HD wallets are secured by a root seed. The root seed, generated from entropy, can be a number up to 64 bytes long. To make the root seed easier to save and recover, a phrase consisting of a list of mnemonic code words is rendered. The following is an example of a root seed: 01bd4085622ab35e0cd934adbdcce6ca To render the mnemonic code words, the root seed number plus its checksum is combined and then divided into groups of 11 bits. Each group of bits represents an index between 0 and 2047. The index is then mapped to a list of 2,048 words. For each group of bits, one word is listed, as shown in the following example, which generates the following phrase: essence forehead possess embarrass giggle spirit further understand fade appreciate angel suffocate BIP-0039 details the specifications for creating mnemonic code words to generate a deterministic key, and is available at https://en.bitcoin.it/wiki/BIP_0039. In the HD wallet, the root seed is used to generate a master private key and a master chain code. The master private key is used to generate a master public key, as with normal Bitcoin private keys and public keys. These keys are then used to generate additional children keys in a tree-like structure. Figure 3.9 illustrates the process of creating the master keys and chain code from a root seed: Figure 3.9 - Generating an HD Wallet's root seed, code words, and master keys Using a child key derivation function, children keys can be generated from the master or parent keys. An index is then combined with the keys and the chain code to generate and organize parent/child relationships. From each parent, two billion children keys can be created, and from each child's private key, the public key and public address can be created. In addition to generating a private key and a public address, each child can be used as a parent to generate its own list of child keys. This allows the organization of the derived keys in a tree-like structure. Hierarchically, an unlimited amount of keys can be created in this way. Figure 3.10 - The relationship between master seed, parent/child chains, and public addresses HD wallets are very practical as thousands of keys and public addresses can be managed by one seed. The entire tree of keys can be backed up and restored simply by the passphrase. HD wallets can be organized and shared in various useful ways. For example, in a company or organization, a parent key and chain code could be issued to generate a list of keys for each department. Each department would then have the ability to render its own set of private/public keys. Alternatively, a public parent key can be given to generate child public keys, but not the private keys. This can be useful in the example of an audit. The organization may want the auditor to perform a balance sheet on a set of public keys, but without access to the private keys for spending. Another use case for generating public keys from a parent public key is for e-commerce. As an example mentioned previously, you may have a website and would like to generate an unlimited amount of public addresses. By generating a public parent key for the website, the shopping card can create new public addresses in real time. HD wallets are very useful for Bitcoin wallet applications. Next, we'll look at a software package called Electrum for setting up an HD wallet to protect your bitcoins. Installing a HD wallet HD wallets are very convenient and practical. To show how we can manage an unlimited number of addresses by a single passphrase, we'll install an HD wallet software package called Electrum. Electrum is an easy-to-use desktop wallet that runs on Windows, OS/X, and Linux. It implements a secure HD wallet that is protected by a 12-word passphrase. It is able to synchronize with the blockchain, using servers that index all the Bitcoin transactions, to provide quick updates to your balances. Electrum has some nice features to help protect your bitcoins. It supports multi-signature transactions, that is transactions that require more than one key to spend coins. Multi-signature transactions are useful when you want to share the responsibility of a Bitcoin address between two or more parties, or to add an extra layer of protection to your Bitcoins. Additionally, Electrum has the ability to create a watching-only version of your wallet. This allows you to give access to your public keys to another party without releasing the private keys. This can be very useful for auditing or accounting purposes. To install Electrum, simply open the url https://electrum.org/#download and follow the instructions for your operating system. On first installation, Electrum will create for you a new wallet identified by a passphrase. Make sure that you protect this passphrase offline! Figure 3.11 - Recording the passphrase from an Electrum wallet Electrum will proceed by asking you to re-enter the passphrase to confirm you have it recorded. Finally, it will ask you for a password. This password is used to encrypt your wallet's seed and any private keys imported into your wallet on-disk. You will need this password any time you send bitcoins from your account. Bitcoins in cold storage If you are responsible for a large amount of bitcoin which can be exposed to online hacking or hardware failure, it is important to minimize your risk. A common schema for minimizing the risk is to split your online wallet between Hot wallet and Cold Storage. A hot wallet refers to your online wallet used for everyday deposits and withdrawals. Based on your customers' needs, you can store the minimum needed to cover the daily business. For example, Coinbase claims to hold approximately five percent of the total bitcoins on deposit in their hot wallet. The remaining amount is stored in cold storage. Cold storage is an offline wallet for bitcoin. Addresses are generated, typically from a deterministic wallet, with their passphrase and private keys stored offline. Periodically, depending on their day-to-day needs, bitcoins are transferred to and from the cold storage. Additionally, bitcoins may be moved to Deep cold storage. These bitcoins are generally more difficult to retrieve. While cold storage transfer may easily be done to cover the needs of the hot wallet, a deep cold storage schema may involve physically accessing the passphrase / private keys from a safe, a safety deposit box, or a bank vault. The reasoning is to slow down the access as much as possible. Cold storage with Electrum We can use Electrum to create a hot wallet and a cold storage wallet. To exemplify, let's imagine a business owner who wants to accept bitcoin from his PC cash register. For security reasons, he may want to allow access to the generation of new addresses to receive Bitcoin, but not access to spending them. Spending bitcoins from this wallet will be secured by a protected computer. To start, create a normal Electrum wallet on the protected computer. Secure the passphrase and assign a strong password to the wallet. Then, from the menu, select Wallet | Master Public Keys. The key will be displayed as shown in figure 3.12. Copy this number and save it to a USB key. Figure 3.12 - Your Electrum wallet's public master key Your master public key can be used to generate new public keys, but without access to the private keys. As mentioned in the previous examples, this has many practical uses, as in our example with the cash register. Next, from your cash register, install Electrum. On setup, or from File | New/Restore, choose Restore a wallet or import keys and the Standard wallet type: Figure 3.13 - Setting up a cash register wallet with Electrum On the next screen, Electrum will prompt you to enter your public master key. Once accepted, Electrum will generate your wallet from the public master key. When ready, your new wallet will be ready to accept bitcoin without access to the private keys. WARNING: If you import private keys into your Electrum wallet, they cannot be restored from your passphrase or public master key. They have not been generated by the root seed and exist independently in the wallet. If you import private keys, make sure to back up the wallet file after every import. Verifying access to a private key When working with public addresses, it may be important to prove that you have access to a private key. By using Bitcoin's cryptographic ability to sign a message, you can verify that you have access to the key without revealing it. This can be offered as proof from a trustee that they control the keys. Using Electrum's built-in message signing feature, we can use the private key in our wallet to sign a message. The message, combined with the digital signature and public address, can later be used to verify that it was signed with the original private key. To begin, choose an address from your wallet. In Electrum, your addresses can be found under the Addresses tab. Next, right click on an address and choose Sign/verify Message. A dialog box allowing you to sign a message will appear: Figure 3.14 - Electrum's Sign/Verify Message features As shown in figure 3.14, you can enter any message you like and sign it with the private key of the address shown. This process will produce a digital signature that can be shared with others to prove that you have access to the private key. To verify the signature on another computer, simply open Electrum and choose Tools | Sign | Verify Message from the menu. You will be prompted with the same dialog as shown in figure 3.14. Copy and paste the message, the address, and the digital signature, and click Verify. The results will be displayed. By requesting a signed message from someone, you can verify that they do, in fact, have control of the private key. This is useful for making sure that the trustee of a cold storage wallet has access to the private keys without releasing or sharing them. Another good  use of message signing is to prove that someone has control of some quantity of bitcoin. By signing a message that includes the public address with funds, one can see that the party is the owner of the funds. Finally, signing and verifying a message can be useful for testing your backups. You can test that your private key and public address completely offline without actually sending bitcoin to the address. Good housekeeping with Bitcoin To ensure the safe-keeping of your bitcoin, it's important to protect your private keys by following a short list of best practices: Never store your private keys unencrypted on your hard drive or in the cloud: Unencrypted wallets can easily be stolen by hackers, viruses, or malware. Make sure your keys are always encrypted before being saved to disk. Never send money to a Bitcoin address without a backup of the private keys: It's really important that you have a backup of your private key before sending money its public address. There are stories of early adopters who have lost significant amounts of bitcoin because of hardware failures or inadvertent mistakes. Always test your backup process by repeating the recovery steps: When setting up a backup plan, make sure to test your plan by backing up your keys, sending a small amount to the address, and recovering the amount from the backup. Message signing and verification is also a useful way to test your private key backups offline. Ensure that you have a secure location for your paper wallets: Unauthorized access to your paper wallets can result in the loss of your bitcoin. Make sure that you keep your wallets in a secure safe, in a bank safety deposit box, or in a vault. It's advisable to keep copies of your wallets in multiple locations. Keep multiple copies of your paper wallets: Paper can easily be damaged by water or direct sunlight. Make sure that you keep multiple copies of your paper wallets in plastic bags, protected from direct light with a cover. Consider writing a testament or will for your Bitcoin wallets: The testament should name who has access to the bitcoin and how they will be distributed. Make sure that you include instructions on how to recover the coins. Never forget your wallet's password or passphrase: This sounds obvious, but it must be emphasized. There is no way to recover a lost password or passphrase. Always use a strong passphrase: A strong passphrase should meet the following requirements: It should be long and difficult to guess It should not be from a famous publication: literature, holy books, and so on It should not contain personal information It should be easy to remember and type accurately It should not be reused between sites and applications Summary So far, we've covered the basics of how to get started with Bitcoin. We've provided a tutorial for setting up an online wallet and for how to buy Bitcoin in 15 minutes. We've covered online exchanges and marketplaces, and how to safely store and protect your bitcoin. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [article] Going Viral [article] Introduction to the Nmap Scripting Engine [article]
Read more
  • 0
  • 0
  • 6188

article-image-linking-section-access-multiple-dimensions
Packt
25 Jun 2013
3 min read
Save for later

Linking Section Access to multiple dimensions

Packt
25 Jun 2013
3 min read
(For more resources related to this topic, see here.) Getting ready Load the following script: Product:LOAD * INLINE [ ProductID, ProductGroup, ProductName 1, GroupA, Great As 2, GroupC, Super Cs 3, GroupC, Mega Cs 4, GroupB, Good Bs 5, GroupB, Busy Bs];Customer:LOAD * INLINE [ CustomerID, CustomerName, Country 1, Gatsby Gang, USA 2, Charly Choc, USA 3, Donnie Drake, USA 4, London Lamps, UK 5, Shylock Homes, UK];Sales:LOAD * INLINE [ CustomerID, ProductID, Sales 1, 2, 3536 1, 3, 4333 1, 5, 2123 2, 2, 45562, 4, 1223 2, 5, 6789 3, 2, 1323 3, 3, 3245 3, 4, 6789 4, 2, 2311 4, 3, 1333 5, 1, 7654 5, 2, 3455 5, 3, 6547 5, 4, 2854 5, 5, 9877];CountryLink:Load Distinct Country, Upper(Country) As COUNTRY_LINKResident Customer;Load Distinct Country, 'ALL' As COUNTRY_LINKResident Customer;ProductLink:Load Distinct ProductGroup, Upper(ProductGroup) As PRODUCT_LINKResident Product;Load Distinct ProductGroup, 'ALL' As PRODUCT_LINKResident Product;//Section Access;Access:LOAD * INLINE [ ACCESS, USERID, PRODUCT_LINK, COUNTRY_LINKADMIN, ADMIN, *, * USER, GM, ALL, ALL USER, CM1, ALL, USA USER, CM2, ALL, UK USER, PM1, GROUPA, ALL USER, PM2, GROUPB, ALL USER, PM3, GROUPC, ALL USER, SM1, GROUPB, UK USER, SM2, GROUPA, USA];Section Application; Note that there is a loop error generated on reload because there is a loop in the data structure. How to do it… Follow these steps to link Section Access to multiple dimensions: Add list boxes to the layout for ProductGroup and Country. Add a statistics box for Sales. Remove // to uncomment the Section Access statement. From the Settings menu, open Document Properties and select the Opening tab. Turn on the Initial Data Reduction Based on Section Access option. Reload and save the document. Close QlikView. Re-open QlikView and open the document. Log in as the Country Manager, CM1, user. Note that USA is the only country. Also, the product group, GroupA, is missing—there are no sales of this product group in USA. Close QlikView and then re-open again. This time, log in as the Sales Manager, SM2. You will not be allowed access to the document. Log into the document as the ADMIN user. Edit the script. Add a second entry for the SM2 user in the Access table as follows: USER, SM2, GROUPA, USA USER, SM2, GROUPB, UK Reload, save, and close the document and QlikView. Re-open and log in as SM2. Note the selections. How it works… Section Access is really quite simple. The user is connected to the data and the data is reduced accordingly. QlikView allows Section Access tables to be connected to multiple dimensions in the main data structure without causing issues with loops. Each associated field acts in the same way as a selection in the layout. The initial setting for the SM2 user contained values that were mutually exclusive. Because of the default Strict Exclusion setting, the SM2 user cannot log in. We changed the script and included multiple rows for the SM2 user. Intuitively, we might expect that, as the first row did not connect to the data, only the second row would connect to the data. However, each field value is treated as an individual selection and all of the values are included. There's more… If we wanted to include solely the composite association of Country and ProductGroup, we would need to derive a composite key in the data set and connect the user to that. In this example, we used the USERID field to test using QlikView logins. However, we would normally use NTNAME to link the user to either a Windows login or a custom login. Resources for Article : Further resources on this subject: Pentaho Reporting: Building Interactive Reports in Swing [Article] Visual ETL Development With IBM DataStage [Article] A Python Multimedia Application: Thumbnail Maker [Article]
Read more
  • 0
  • 0
  • 6166
article-image-segmenting-images-opencv
Packt
21 Jun 2011
7 min read
Save for later

Segmenting images in OpenCV

Packt
21 Jun 2011
7 min read
  OpenCV 2 Computer Vision Application Programming Cookbook Over 50 recipes to master this library of programming functions for real-time computer vision         Read more about this book       OpenCV (Open Source Computer Vision) is an open source library containing more than 500 optimized algorithms for image and video analysis. Since its introduction in 1999, it has been largely adopted as the primary development tool by the community of researchers and developers in computer vision. OpenCV was originally developed at Intel by a team led by Gary Bradski as an initiative to advance research in vision and promote the development of rich, vision-based CPU-intensive applications. In the previous article by Robert Laganière, author of OpenCV 2 Computer Vision Application Programming Cookbook, we took a look at image processing using morphological filters. In this article we will see how to segment images using watersheds and GrabCut algorithm. (For more resources related to this subject, see here.) Segmenting images using watersheds The watershed transformation is a popular image processing algorithm that is used to quickly segment an image into homogenous regions. It relies on the idea that when the image is seen as a topological relief, homogeneous regions correspond to relatively flat basins delimitated by steep edges. As a result of its simplicity, the original version of this algorithm tends to over-segment the image which produces multiple small regions. This is why OpenCV proposes a variant of this algorithm that uses a set of predefined markers which guide the definition of the image segments. How to do it... The watershed segmentation is obtained through the use of the cv::watershed function. The input to this function is a 32-bit signed integer marker image in which each non-zero pixel represents a label. The idea is to mark some pixels of the image that are known to certainly belong to a given region. From this initial labeling, the watershed algorithm will determine the regions to which the other pixels belong. In this recipe, we will first create the marker image as a gray-level image, and then convert it into an image of integers. We conveniently encapsulated this step into a WatershedSegmenter class: class WatershedSegmenter { private: cv::Mat markers;public: void setMarkers(const cv::Mat& markerImage) { // Convert to image of ints markerImage.convertTo(markers,CV_32S); } cv::Mat process(const cv::Mat &image) { // Apply watershed cv::watershed(image,markers); return markers; } The way these markers are obtained depends on the application. For example, some preprocessing steps might have resulted in the identification of some pixels belonging to an object of interest. The watershed would then be used to delimitate the complete object from that initial detection. In this recipe, we will simply use the binary image used in the previous article (OpenCV: Image Processing using Morphological Filters) in order to identify the animals of the corresponding original image. Therefore, from our binary image, we need to identify pixels that certainly belong to the foreground (the animals) and pixels that certainly belong to the background (mainly the grass). Here, we will mark foreground pixels with label 255 and background pixels with label 128 (this choice is totally arbitrary, any label number other than 255 would work). The other pixels, that is the ones for which the labeling is unknown, are assigned value 0. As it is now, the binary image includes too many white pixels belonging to various parts of the image. We will then severely erode this image in order to retain only pixels belonging to the important objects: // Eliminate noise and smaller objectscv::Mat fg;cv::erode(binary,fg,cv::Mat(),cv::Point(-1,-1),6); The result is the following image: Note that a few pixels belonging to the background forest are still present. Let's simply keep them. Therefore, they will be considered to correspond to an object of interest. Similarly, we also select a few pixels of the background by a large dilation of the original binary image: // Identify image pixels without objectscv::Mat bg;cv::dilate(binary,bg,cv::Mat(),cv::Point(-1,-1),6);cv::threshold(bg,bg,1,128,cv::THRESH_BINARY_INV); The resulting black pixels correspond to background pixels. This is why the thresholding operation immediately after the dilation assigns to these pixels the value 128. The following image is then obtained: These images are combined to form the marker image: // Create markers imagecv::Mat markers(binary.size(),CV_8U,cv::Scalar(0));markers= fg+bg; Note how we used the overloaded operator+ here in order to combine the images. This is the image that will be used as input to the watershed algorithm: The segmentation is then obtained as follows: // Create watershed segmentation objectWatershedSegmenter segmenter;// Set markers and processsegmenter.setMarkers(markers);segmenter.process(image); The marker image is then updated such that each zero pixel is assigned one of the input labels, while the pixels belonging to the found boundaries have value -1. The resulting image of labels is then: The boundary image is: How it works... As we did in the preceding recipe, we will use the topological map analogy in the description of the watershed algorithm. In order to create a watershed segmentation, the idea is to progressively flood the image starting at level 0. As the level of "water" progressively increases (to levels 1, 2, 3, and so on), catchment basins are formed. The size of these basins also gradually increase and, consequently, the water of two different basins will eventually merge. When this happens, a watershed is created in order to keep the two basins separated. Once the level of water has reached its maximal level, the sets of these created basins and watersheds form the watershed segmentation. As one can expect, the flooding process initially creates many small individual basins. When all of these are merged, many watershed lines are created which results in an over-segmented image. To overcome this problem, a modification to this algorithm has been proposed in which the flooding process starts from a predefined set of marked pixels. The basins created from these markers are labeled in accordance with the values assigned to the initial marks. When two basins having the same label merge, no watersheds are created, thus preventing the oversegmentation. This is what happens when the cv::watershed function is called. The input marker image is updated to produce the final watershed segmentation. Users can input a marker image with any number of labels with pixels of unknown labeling left to value 0. The marker image has been chosen to be an image of a 32-bit signed integer in order to be able to define more than 255 labels. It also allows the special value -1, to be assigned to pixels associated with a watershed. This is what is returned by the cv::watershed function. To facilitate the displaying of the result, we have introduced two special methods. The first one returns an image of the labels (with watersheds at value 0). This is easily done through thresholding: // Return result in the form of an imagecv::Mat getSegmentation() { cv::Mat tmp; // all segment with label higher than 255 // will be assigned value 255 markers.convertTo(tmp,CV_8U); return tmp;} Similarly, the second method returns an image in which the watershed lines are assigned value 0, and the rest of the image is at 255. This time, the cv::convertTo method is used to achieve this result: // Return watershed in the form of an imagecv::Mat getWatersheds() { cv::Mat tmp; // Each pixel p is transformed into // 255p+255 before conversion markers.convertTo(tmp,CV_8U,255,255); return tmp;} The linear transformation that is applied before the conversion allows -1 pixels to be converted into 0 (since -1*255+255=0). Pixels with a value greater than 255 are assigned the value 255. This is due to the saturation operation that is applied when signed integers are converted into unsigned chars. See also The article The viscous watershed transform by C. Vachier, F. Meyer, Journal of Mathematical Imaging and Vision, volume 22, issue 2-3, May 2005, for more information on the watershed transform. The next recipe which presents another image segmentation algorithm that can also segment an image into background and foreground objects.
Read more
  • 0
  • 0
  • 6135

article-image-oracle-goldengate-11g-performance-tuning
Packt
01 Mar 2011
12 min read
Save for later

Oracle GoldenGate 11g: Performance Tuning

Packt
01 Mar 2011
12 min read
  Oracle GoldenGate 11g Implementer's guide Design, install, and configure high-performance data replication solutions using Oracle GoldenGate The very first book on GoldenGate, focused on design and performance tuning in enterprise-wide environments Exhaustive coverage and analysis of all aspects of the GoldenGate software implementation, including design, installation, and advanced configuration Migrate your data replication solution from Oracle Streams to GoldenGate Design a GoldenGate solution that meets all the functional and non-functional requirements of your system Written in a simple illustrative manner, providing step-by-step guidance with discussion points Goes way beyond the manual, appealing to Solution Architects, System Administrators and Database Administrators        Oracle states that GoldenGate can achieve near real-time data replication. However, out of the box, GoldenGate may not meet your performance requirements. Here we focus on the main areas that lend themselves to tuning, especially parallel processing and load balancing, enabling high data throughput and very low latency. Let's start by taking a look at some of the considerations before we start tuning Oracle GoldenGate. Before tuning GoldenGate There are a number of considerations we need to be aware of before we start the tuning process. For one, we must consider the underlying system and its ability to perform. Let's start by looking at the source of data that GoldenGate needs for replication to work the online redo logs. Online redo Before we start tuning GoldenGate, we must look at both the source and target databases and their ability to read/write data. Data replication is I/O intensive, so fast disks are important, particularly for the online redo logs. Redo logs play an important role in GoldenGate: they are constantly being written to by the database and concurrently being read by the Extract process. Furthermore, adding supplemental logging to a database can increase their size by a factor of 4! Firstly, ensure that only the necessary amount of supplemental logging is enabled on the database. In the case of GoldenGate, the logging of the Primary Key is all that is required. Next, take a look at the database wait events, in particular the ones that relate to redo. For example, if you are seeing "Log File Sync" waits, this is an indicator that either your disk writes are too slow or your application is committing too frequently, or a combination of both. RAID5 is another common problem for redo log writes. Ideally, these files should be placed on their own mirrored storage such as RAID1+0 (mirrored striped sets) or Flash disks. Many argue this to be a misconception with modern high speed disk arrays, but some production systems are still known to be suffering from redo I/O contention on RAID5. An adequate number (and size) of redo groups must be configured to prevent "checkpoint not complete" or "cannot allocate new log" warnings appearing in the database instance alert log. This occurs when Oracle attempts to reuse a log file but the checkpoint that would flush the blocks in the DB buffer cache to disk are still required for crash recovery. The database must wait until that checkpoint completes before the online redolog file can be reused, effectively stalling the database and any redo generation. Large objects (LOBs) Know your data. LOBs can be a problem in data replication by virtue of their size and the ability to extract, transmit, and deliver the data from source to target. Tables containing LOB datatypes should be isolated from regular data to use a dedicated Extract, Data Pump, and Replicat process group to enhance throughput. Also ensure that the target table has a primary key to avoid Full Table Scans (FTS), an Oracle GoldenGate best practice. LOB INSERT operations can insert an empty (null) LOB into a row before updating it with the data. This is because a LOB (depending on its size) can spread its data across multiple Logical Change Records, resulting in multiple DML operations required at the target database. Base lining Before we can start tuning, we must record our baseline. This will provide a reference point to tune from. We can later look back at our baseline and calculate the percentage improvement made from deploying new configurations. An ideal baseline is to find the "breaking point" of your application requirements. For example, the following questions must be answered: What is the maximum acceptable end to end latency? What are the maximum application transactions per second we must accommodate? To answer these questions we must start with a single threaded data replication configuration having just one Extract, one Data Pump, and one Replicat process. This will provide us with a worst case scenario in which to build improvements on. Ideally, our data source should be the application itself, inserting, deleting, and updating "real data" in the source database. However, simulated data with the ability to provide throughput profiles will allow us to gauge performance accurately Application vendors can normally provide SQL injector utilities that simulate the user activity on the system. Balancing the load across parallel process groups The GoldenGate documentation states "The most basic thing you can do to improve GoldenGate's performance is to divide a large number of tables among parallel processes and trails. For example, you can divide the load by schema".This statement is true as the bottleneck is largely due to the serial nature of the Replicat process, having to "replay" transactions in commit order. Although this can be a constraining factor due to transaction dependency, increasing the number of Replicat processes increases performance significantly. However, it is highly recommended to group tables with referential constraints together per Replicat. The number of parallel processes is typically greater on the target system compared to the source. The number and ratio of processes will vary across applications and environments. Each configuration should be thoroughly tested to determine the optimal balance, but be careful not to over allocate, as each parallel process will consume up to 55MB. Increasing the number of processes to an arbitrary value will not necessarily improve performance, in fact it may be worse and you will waste CPU and memory resources. The following data flow diagram shows a load balancing configuration including two Extract processes, three Data Pump, and five Replicats: Considerations for using parallel process groups To maintain data integrity, ensure to include tables with referential constraints between one another in the same parallel process group. It's also worth considering disabling referential constraints on the target database schema to allow child records to be populated before their parents, thus increasing throughput. GoldenGate will always commit transactions in the same order as the source, so data integrity is maintained. Oracle best practice states no more than 3 Replicat processes should read the same remote trail file. To avoid contention on Trail files, pair each Replicat with its own Trail files and Extract process. Also, remember that it is easier to tune an Extract process than a Replicat process, so concentrate on your source before moving your focus to the target. Splitting large tables into row ranges across process groups What if you have some large tables with a high data change rate within a source schema and you cannot logically separate them from the remaining tables due to referential constraints? GoldenGate provides a solution to this problem by "splitting" the data within the same schema via the @RANGE function. The @RANGE function can be used in the Data Pump and Replicat configuration to "split" the transaction data across a number of parallel processes. The Replicat process is typically the source of performance bottlenecks because, in its normal mode of operation, it is a single-threaded process that applies operations one at a time by using regular DML. Therefore, to leverage parallel operation and enhance throughput, the more Replicats the better (dependant on the number of CPUs and memory available on the target system). The RANGE function The way the @RANGE function works is it computes a hash value of the columns specified in the input. If no columns are specified, it uses the table's primary key. GoldenGate adjusts the total number of ranges to optimize the even distribution across the number of ranges specified. This concept can be compared to Hash Partitioning in Oracle tables as a means of dividing data. With any division of data during replication, the integrity is paramount and will have an effect on performance. Therefore, tables having a relationship with other tables in the source schema must be included in the configuration. If all your source schema tables are related, you must include all the tables! Adding Replicats with @RANGE function The @RANGE function accepts two numeric arguments, separated by a comma: Range: The number assigned to a process group, where the first is 1 and the second 2 and so on, up to the total number of ranges. Total number of ranges: The total number of process groups you wish to divide using the @RANGE function. The following example includes three related tables in the source schema and walks through the complete configuration from start to finish. For this example, we have an existing Replicat process on the target machine (dbserver2) named ROLAP01 that includes the following three tables: ORDERS ORDER_ITEMS PRODUCTS We are going to divide the rows of the tables across two Replicat groups. The source database schema name is SRC and target schema TGT. The following steps add a new Replicat named ROLAP02 with the relevant configuration and adjusts Replicat ROLAP01 parameters to suit. Note that before conducting any changes stop the existing Replicat processes and determine their Relative Byte Address (RBA) and Trail file log sequence number. This is important information that we will use to tell the new Replicat process from which point to start. First check if the existing Replicat process is running: GGSCI (dbserver2) 1> info all Program Status Group Lag Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING ROLAP01 00:00:00 00:00:02 Stop the existing Replicat process: GGSCI (dbserver2) 2> stop REPLICAT ROLAP01 Sending STOP request to REPLICAT ROLAP01... Request processed. Add the new Replicat process, using the existing trail file. GGSCI (dbserver2) 3> add REPLICAT ROLAP02, exttrail ./dirdat/tb REPLICAT added. Now add the configuration by creating a new parameter file for ROLAP02. GGSCI (dbserver2) 4> edit params ROLAP02 -- -- Example Replicator parameter file to apply changes -- to target tables -- REPLICAT ROLAP02 SOURCEDEFS ./dirdef/mydefs.def SETENV (ORACLE_SID= OLAP) USERID ggs_admin, PASSWORD ggs_admin DISCARDFILE ./dirrpt/rolap02.dsc, PURGE ALLOWDUPTARGETMAP CHECKPOINTSECS 30 GROUPTRANSOPS 2000 MAP SRC.ORDERS, TARGET TGT.ORDERS, FILTER (@RANGE (1,2)); MAP SRC.ORDER_ITEMS, TARGET TGT.ORDER_ITEMS, FILTER (@RANGE (1,2)); MAP SRC.PRODUCTS, TARGET TGT.PRODUCTS, FILTER (@RANGE (1,2)); Now edit the configuration of the existing Replicat process, and add the @RANGE function to the FILTER clause of the MAP statement. Note the inclusion of the GROUPTRANSOPS parameter to enhance performance by increasing the number of operations allowed in a Replicat transaction. GGSCI (dbserver2) 5> edit params ROLAP01 -- -- Example Replicator parameter file to apply changes -- to target tables -- REPLICAT ROLAP01 SOURCEDEFS ./dirdef/mydefs.def SETENV (ORACLE_SID=OLAP) USERID ggs_admin, PASSWORD ggs_admin DISCARDFILE ./dirrpt/rolap01.dsc, PURGE ALLOWDUPTARGETMAP CHECKPOINTSECS 30 GROUPTRANSOPS 2000 MAP SRC.ORDERS, TARGET TGT.ORDERS, FILTER (@RANGE (2,2)); MAP SRC.ORDER_ITEMS, TARGET TGT.ORDER_ITEMS, FILTER (@RANGE (2,2)); MAP SRC.PRODUCTS, TARGET TGT.PRODUCTS, FILTER (@RANGE (2,2)); Check that both the Replicat processes exist. GGSCI (dbserver2) 6> info all Program Status Group Lag Time Since Chkpt MANAGER RUNNING REPLICAT STOPPED ROLAP01 00:00:00 00:10:35 REPLICAT STOPPED ROLAP02 00:00:00 00:12:25 Before starting both Replicat processes, obtain the log Sequence Number (SEQNO) and Relative Byte Address (RBA) from the original trail file. GGSCI (dbserver2) 7> info REPLICAT ROLAP01, detail REPLICAT ROLAP01 Last Started 2010-04-01 15:35 Status STOPPED Checkpoint Lag 00:00:00 (updated 00:12:43 ago) Log Read Checkpoint File ./dirdat/tb000279 <- SEQNO 2010-04-08 12:27:00.001016 RBA 43750979 <- RBA Extract Source Begin End ./dirdat/tb000279 2010-04-01 12:47 2010-04-08 12:27 ./dirdat/tb000257 2010-04-01 04:30 2010-04-01 12:47 ./dirdat/tb000255 2010-03-30 13:50 2010-04-01 04:30 ./dirdat/tb000206 2010-03-30 13:50 First Record ./dirdat/tb000206 2010-03-30 04:30 2010-03-30 13:50 ./dirdat/tb000184 2010-03-30 04:30 First Record ./dirdat/tb000184 2010-03-30 00:00 2010-03-30 04:30 ./dirdat/tb000000 *Initialized* 2010-03-30 00:00 ./dirdat/tb000000 *Initialized* First Record Adjust the new Replicat process ROLAP02 to adopt these values, so that the process knows where to start from on startup. GGSCI (dbserver2) 8> alter replicat ROLAP02, extseqno 279 REPLICAT altered. GGSCI (dbserver2) 9> alter replicat ROLAP02, extrba 43750979 REPLICAT altered. Failure to complete this step will result in either duplicate data or ORA-00001 against the target schema, because GoldenGate will attempt to replicate the data from the beginning of the initial trail file (./dirdat/tb000000) if it exists, else the process will abend. Start both Replicat processes. Note the use of the wildcard (*). GGSCI (dbserver2) 10> start replicat ROLAP* Sending START request to MANAGER ... REPLICAT ROLAP01 starting Sending START request to MANAGER ... REPLICAT ROLAP02 starting Check if both Replicat processes are running. GGSCI (dbserver2) 11> info all Program Status Group Lag Time Since Chkpt MANAGER RUNNING REPLICAT RUNNING ROLAP01 00:00:00 00:00:22 REPLICAT RUNNING ROLAP02 00:00:00 00:00:14 Check the detail of the new Replicat processes. GGSCI (dbserver2) 12> info REPLICAT ROLAP02, detail REPLICAT ROLAP02 Last Started 2010-04-08 14:18 Status RUNNING Checkpoint Lag 00:00:00 (updated 00:00:06 ago) Log Read Checkpoint File ./dirdat/tb000279 First Record RBA 43750979 Extract Source Begin End ./dirdat/tb000279 * Initialized * First Record ./dirdat/tb000279 * Initialized * First Record ./dirdat/tb000279 * Initialized * 2010-04-08 12:26 ./dirdat/tb000279 * Initialized * First Record Generate a report for the new Replicat process ROLAP02. GGSCI (dbserver2) 13> send REPLICAT ROLAP02, report Sending REPORT request to REPLICAT ROLAP02 ... Request processed. Now view the report to confirm the new Replicat process has started from the specified start point. (RBA 43750979 and SEQNO 279). The following is an extract from the report: GGSCI (dbserver2) 14> view report ROLAP02 2010-04-08 14:20:18 GGS INFO 379 Positioning with begin time: Apr 08, 2010 14:18:19 PM, starting record time: Apr 08, 2010 14:17:25 PM at extseqno 279, extrba 43750979.  
Read more
  • 0
  • 0
  • 6095
Modal Close icon
Modal Close icon