Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-big-data-analytics
Packt
03 Nov 2015
10 min read
Save for later

Big Data Analytics

Packt
03 Nov 2015
10 min read
In this article, Dmitry Anoshin, the author of Learning Hunk will talk about Hadoop—how to extract Hunk to VM to set up a connection with Hadoop to create dashboards. We are living in a century of information technology. There are a lot of electronic devices around us that generate a lot of data. For example, you can surf on the Internet, visit a couple news portals, order new Airmax on the web store, write a couple of messages to your friend, and chat on Facebook. Every action produces data; we can multiply these actions with the number of people who have access to the Internet or just use a mobile phone and we will get really big data. Of course, you have a question, how big is it? I suppose, now it starts from terabytes or even petabytes. The volume is not the only issue; we struggle with a variety of data. As a result, it is not enough to analyze only structure data. We should dive deep into the unstructured data, such as machine data, that are generated by various machines. World famous enterprises try to collect this extremely big data in order to monetize it and find business insights. Big data offers us new opportunities, for example, we can enrich customer data via social networks using the APIs of Facebook or Twitter. We can build customer profiles and try to predict customer wishes in order to sell our product or improve customer experience. It is easy to say, but difficult to do. However, organizations try to overcome these challenges and use big data stores, such as Hadoop. (For more resources related to this topic, see here.) The big problem Hadoop is a distributed file system and framework to compute. It is relatively easy to get data into Hadoop. There are plenty of tools to get data into different formats. However, it is extremely difficult to get value out of these data that you put into Hadoop. Let's look at the path from data to value. First, we have to start at the collection of data. Then, we also spend a lot of time preparing and making sure this data is available for analysis while being able to ask questions to this data. It looks as follows: Unfortunately, the questions that you asked are not good or the answers that you got are not clear, and you have to repeat this cycle over again. Maybe, you have transformed and formatted your data. In other words, it is a long and challenging process. What you actually want is something to collect data; spend some time preparing the data, then you would able to ask question and get answers from data repetitively. Now, you can spend a lot of time asking multiple questions. In addition, you are able to iterate with data on those questions to refine the answers that you are looking for. The elegant solution What if we could take Splunk and put it on top of all these data stored in Hadoop? And it was, what the Splunk company actually did. The following figure shows how we got Hunk as name of the new product: Let's discuss some solution goals Hunk inventors were thinking about when they were planning Hunk: Splunk can take data from Hadoop via the Splunk Hadoop Connection app. However, it is a bad idea to copy massive data from Hadoop to Splunk. It is much better to process data in place because Hadoop provides both storage and computation and why not take advantage of both. Splunk has extremely powerful Splunk Processing Language (SPL) and it is a kind of advantage of Splunk, because it has a wide range of analytic functions. This is why it is a good idea to keep SPL in the new product. Splunk has true schema on the fly. The data that we store in Hadoop changes constantly. So, Hunk should be able to build schema on the fly, independent from the format of the data. It's a very good idea to have the ability to make previews. As you know, when a search is going on, you would able to get incremental results. It can dramatically reduce the outage. For example, we don't need to wait till the MapReduce job is finished. We can look at the incremental result and, in the case of a wrong result, restart a search query. The deployment of Hadoop is not easy; Splunk tries to make the installation and configuration of Hunk easy for us. Getting up Hunk In order to start exploring the Hadoop data, we have to install Hunk on the top of our Hadoop cluster. Hunk is easy to install and configure. Let's learn how to deploy Hunk Version 6.2.1 on top of the existing CDH cluster. It's assumed that your VM is up and running. Extracting Hunk to VM To extract Hunk to VM, perform the following steps: Open the console application. Run ls -la to see the list of files in your Home directory: [cloudera@quickstart ~]$ cd ~ [cloudera@quickstart ~]$ ls -la | grep hunk -rw-r--r--   1 root     root     113913609 Mar 23 04:09 hunk-6.2.1-249325-Linux-x86_64.tgz Unpack the archive: cd /opt sudo tar xvzf /home/cloudera/hunk-6.2.1-249325-Linux-x86_64.tgz -C /opt Setting up Hunk variables and configuration files Perform the following steps to set up the Hunk variables and configuration files It's time to set the SPLUNK_HOME environment variable. This variable is already added to the profile; it is just to bring to your attention that this variable must be set: export SPLUNK_HOME=/opt/hunk Use default splunk-launch.conf. This is the basic properties file used by the Hunk service. We don't have to change there something special, so let's use the default settings: sudocp /opt/hunk/etc/splunk-launch.conf.default /opt/hunk//etc/splunk-launch.conf Running Hunk for the first time Perform the following steps to run Hunk: Run Hunk: sudo /opt/hunk/bin/splunk start --accept-license Here is the sample output from the first run: sudo /opt/hunk/bin/splunk start --accept-license This appears to be your first time running this version of Splunk. Copying '/opt/hunk/etc/openldap/ldap.conf.default' to '/opt/hunk/etc/openldap/ldap.conf'. Generating RSA private key, 1024 bit long modulus Some output lines were deleted to reduce amount of log text Waiting for web server at http://127.0.0.1:8000 to be available.... Done If you get stuck, we're here to help. Look for answers here: http://docs.splunk.com The Splunk web interface is at http://vm-cluster-node1.localdomain:8000 Setting up a data provider and virtual index for the CDR data We need to accomplish two tasks: provide a technical connector to the underlying data storage and create a virtual index for the data on this storage. Log in to http://quickstart.cloudera:8000. The system would ask you to change the default admin user password. I did set it to admin: Setting up a connection to Hadoop Right now, we are ready to set up the integration between Hadoop and Hunk. At first, we need to specify the way Hunk connects to the current Hadoop installation. We are using the most recent way: YARN with MR2. Then, we have to point virtual indexes to the data stored on Hadoop. To do this, perform the following steps: Click on Explore Data. Click on Create a provider: Let's fill the form to create the data provider: Property name Value Name hadoop-hunk-provider Java home /usr/java/jdk1.7.0_67-cloudera Hadoop home /usr/lib/hadoop Hadoop version Hadoop 2.x, (Yarn) filesystem hdfs://quickstart.cloudera:8020 Resource Manager Address quickstart.cloudera:8032 Resource Scheduler Address quickstart.cloudera:8030 HDFS Working Directory /user/hunk Job Queue default You don't have to modify any other properties. The HDFS working directory has been created for you in advance. You can create it using the following command: sudo -u hdfshadoop fs -mkdir -p /user/hunk If you did everything correctly, you should see a screen similar to the following screenshot: Let's discuss briefly what we have done: We told Hunk where Hadoop home and Java are. Hunk uses Hadoop streaming internally, so it needs to know how to call Java and Hadoop streaming. You can inspect the submitted jobs from Hunk (discussed later) and see the following lines: /opt/hunk/bin/jars/sudobash /usr/bin/hadoop jar "/opt/hunk/bin/jars/SplunkMR-s6.0-hy2.0.jar" "com.splunk.mr.SplunkMR" MapReduce JAR is submitted by Hunk. Also, we need to tell Hunk where the YARN Resource Manager and Scheduler are located. These services allow us to ask for cluster resources and run jobs. Job queue could be useful in the production environment. You could have several queues for cluster resource distribution in real life. We would set queue name as default, since we are not discussing cluster utilization and load balancing. Setting up a virtual index for the data stored in Hadoop Now it's time to create virtual index. We are going to add the dataset with the avro files to the virtual index as an example data. Click on Explore Data and then click on Create a virtual index: You'll get a message telling that there are no indexes: Just click on New Virtual Index. A virtual index is a metadata. It tells Hunk where the data is located and what provider should be used to read the data. Property name Value Name milano_cdr_aggregated_10_min_activity Path to data in HDFS /masterdata/stream/milano_cdr Here is an example screen you should see after you create your first virtual index: Accessing data through the virtual index To access data through the virtual index, perform the following steps: Click on Explore Data and select a provider and virtual index: Select part-m-00000.avro by clicking on it. The Next button will be activated after you pick up a file: Preview data in the Preview Data tab. You should see how Hunk automatically for timestamp from our CDR data: Pay attention to the Time column and the field named Time_interval from the Event column. The time_interval column keeps the time of record. Hunk should automatically use that field as a time field: Save the source type by clicking on Save As and then Next: In the Entering Context Settings page, select search in the App context drop-down box. Then, navigate to Sharing context | All apps and then click on Next. The last step allows you to review what we've done: Click on Finish to create the finalized wizard. Creating a dashbord Now it's time to see how the dashboards work. Let's find the regions where the visitors face problems (status = 500) while using our online store: index="digital_analytics" status=500 | iplocation clientip | geostats latfield=lat longfield=lon count by Country You should see the map and the portions of error for the countries: Now let's save it as dashboard. Click on Save as and select Dashboard panel from drop-down menu. Name it as Web Operations. You should get a new dashboard with a single panel and our report on it. We have several previously created reports. Let's add them to the newly created dashboard using separate panels: Click on Edit and then Edit panels. Select Add new panel and then New from report, and add one of our previous reports. Summary In this article, you learned how to extract Hunk to VM. We also saw how to set up Hunk variables and configuration files. You learned how to run Hunk and how to set up the data provided and a virtual index for the CDR data. Setting up a connection to Hadoop and a virtual index for the data stored in Hadoop were also covered in detail. Apart from these, you also learned how to create a dashboard. Resources for Article: Further resources on this subject: Identifying Big Data Evidence in Hadoop [Article] Big Data [Article] Understanding Hadoop Backup and Recovery Needs [Article]
Read more
  • 0
  • 0
  • 2963

article-image-protecting-your-bitcoins
Packt
29 Oct 2015
32 min read
Save for later

Protecting Your Bitcoins

Packt
29 Oct 2015
32 min read
In this article by Richard Caetano author of the book Learning Bitcoin, we will explore ways to safely hold your own bitcoin. We will cover the following topics: Storing your bitcoins Working with brainwallet Understanding deterministic wallets Storing Bitcoins in cold storage Good housekeeping with Bitcoin (For more resources related to this topic, see here.) Storing your bitcoins The banking system has a legacy of offering various financial services to its customers. They offer convenient ways to spend money, such as cheques and credit cards, but the storage of money is their base service. For many centuries, banks have been a safe place to keep money. Customers rely on the interest paid on their deposits, as well as on the government insurance against theft and insolvency. Savings accounts have helped make preserving the wealth easy, and accessible to a large population in the western world. Yet, some people still save a portion of their wealth as cash or precious metals, usually in a personal safe at home or in a safety deposit box. They may be those who have, over the years, experienced or witnessed the downsides of banking: government confiscation, out of control inflation, or runs on the bank. Furthermore, a large population of the world does not have access to the western banking system. For those who live in remote areas or for those without credit, opening a bank account is virtually impossible. They must handle their own money properly to prevent loss or theft. In some places of the world, there can be great risk involved. These groups of people, who have little or no access to banking, are called the "underbanked". For the underbanked population, Bitcoin offers immediate access to a global financial system. Anyone with access to the internet or who carries a mobile phone with the ability to send and receive SMS messages, can hold his or her own bitcoin and make global payments. They can essentially become their own bank. However, you must understand that Bitcoin is still in its infancy as a technology. Similar to the Internet of circa 1995, it has demonstrated enormous potential, yet lacks usability for a mainstream audience. As a parallel, e-mail in its early days was a challenge for most users to set up and use, yet today it's as simple as entering your e-mail address and password on your smartphone. Bitcoin has yet to develop through these stages. Yet, with some simple guidance, we can already start realizing its potential. Let's discuss some general guidelines for understanding how to become your own bank. Bitcoin savings In most normal cases, we only keep a small amount of cash in our hand wallets to protect ourselves from theft or accidental loss. Much of our money is kept in checking or savings accounts with easy access to pay our bills. Checking accounts are used to cover our rent, utility bills, and other payments, while our savings accounts hold money for longer-term goals, such as a down payment on buying a house. It's highly advisable to develop a similar system for managing your Bitcoin money. Both local and online wallets provide a convenient way to access your bitcoins for day-to-day transactions. Yet there is the unlikely risk that one could lose his or her Bitcoin wallet due to an accidental computer crash or faulty backup. With online wallets, we run the risk of the website or the company becoming insolvent, or falling victim to cybercrime. By developing a reliable system, we can adopt our own personal 'Bitcoin Savings' account to hold our funds for long-term storage. Usually, these savings are kept offline to protect them from any kind of computer hacking. With protected access to our offline storage, we can periodically transfer money to and from our savings. Thus, we can arrange our Bitcoin funds much as we manage our money with our hand wallets and checking/savings accounts. Paper wallets As explained, a private key is a large random number that acts as the key to spend your bitcoins. A cryptographic algorithm is used to generate a private key and, from it, a public address. We can share the public address to receive bitcoins, and, with the private key, spend the funds sent to the address. Generally, we rely on our Bitcoin wallet software to handle the creation and management of our private keys and public addresses. As these keys are stored on our computers and networks, they are vulnerable to hacking, hardware failures, and accidental loss. Private keys and public addresses are, in fact, just strings of letters and numbers. This format makes it easy to move the keys offline for physical storage. Keys printed on paper are called "paper wallet" and are highly portable and convenient to store in a physical safe or a bank safety deposit box. With the private key generated and stored offline, we can safely send bitcoin to its public address. A paper wallet must include at least one private key and its computed public address. Additionally, the paper wallet can include a QR code to make it convenient to retrieve the key and address. Figure 3.1 is an example of a paper wallet generated by Coinbase: Figure 3.1 - Paper wallet generated from Coinbase The paper wallet includes both the public address (labeled Public key) and the private key, both with QR codes to easily transfer them back to your online wallet. Also included on the paper wallet is a place for notes. This type of wallet is easy to print for safe storage. It is recommended that copies are stored securely in multiple locations in case the paper is destroyed. As the private key is shown in plain text, anyone who has access to this wallet has access to the funds. Do not store your paper wallet on your computer. Loss of the paper wallet due to hardware failure, hacking, spyware, or accidental loss can result in complete loss of your bitcoin. Make sure you have multiple copies of your wallet printed and securely stored before transferring your money. One time use paper wallets Transactions from bitcoin addresses must include the full amount. When sending a partial amount to a recipient, the remaining balance must be sent to a change address. Paper wallet that includes only one private key are considered to be "one time use" paper wallet. While you can always send multiple transfers of bitcoin to the wallet, it is highly recommended that you spend the coins only once. Therefore, you shouldn't move a large number of bitcoins to the wallet expecting to spend a partial amount. With this in mind, when using one-time use paper wallet, it's recommended that you only save a usable amount to each wallet. This amount could be a block of coins that you'd like to fully redeem to your online wallet. Creating a paper wallet To create a paper wallet in Coinbase, simply log in with your username and password. Click on the Tools link on the left-hand side menu. Next, click on the Paper Wallets link from the above menu. Coinbase will prompt you to Generate a paper wallet and Import a paper wallet. Follow the links to generate a paper wallet. You can expect to see the paper wallet rendered, as shown in the following figure 3.2: Figure 3.2 - Creating a paper wallet with Coinbase Coinbase generates your paper wallet completely from your browser, without sending the private key back to its server. This is important to protect your private key from exposure to the network. You are generating the only copy of your private key. Make sure that you print and securely store multiple copies of your paper wallet before transferring any money to it. Loss of your wallet and private key will result in the loss of your bitcoin. By clicking the Regenerate button, you can generate multiple paper wallets and store various amounts of bitcoin on each wallet. Each wallet is easily redeemable in full at Coinbase or with other bitcoin wallet services. Verifying your wallet's balance After generating and printing multiple copies of your paper wallet, you're ready to transfer your funds. Coinbase will prompt you with an easy option to transfer the funds from your Coinbase wallet to your paper wallet: Figure 3.3 - Transferring funds to your paper wallet Figure 3.3 shows Coinbase's prompt to transfer your funds. It provides options to enter your amount in BTC or USD. Simply specify your amount and click Send. Note that Coinbase only keeps a copy of your public address. You can continue to send additional amounts to your paper wallet using the same public address. For your first time working with paper wallets, it's advisable that you only send small amounts of bitcoin, to learn and experiment with the process. Once you feel comfortable with creating and redeeming paper wallets, you can feel secure with transferring larger amounts. To verify that the funds have been moved to your paper wallet, we can use a blockchain explorer to verify that the funds have been confirmed by the network. Blockchain explorers make all the transaction data from the Bitcoin network available for public review. We'll use a service called Blockchain.info to verify our paper wallet. Simply open www.blockchain.info in your browser and enter the public key from your paper wallet in the search box. If found, Blockchain.info will display a list of the transaction activities on that address: Figure 3.4 - Blockchain.info showing transaction activity Shown in figure 3.4 is the transaction activity for the address starting with 16p9Lt. You can quickly see the total bitcoin received and the current balance. Under the Transactions section, you can find the details of the transactions recorded by the network. Also listed are the public addresses that were combined by the wallet software, as well as the change address used to complete the transfer. Note that at least six confirmations are required before the transaction is considered confirmed. Importing versus sweeping When importing your private key, the wallet software will simply add the key to its list of private keys. Your bitcoin wallet will manage your list of private keys. When sending money, it will combine the balances from multiple addresses to make the transfer. Any remaining amount will be sent back to the change address. The wallet software will automatically manage your change addresses. Some Bitcoin wallets offer the ability to sweep yourc private key. This involves a second step. After importing your private key, the wallet software will make a transaction to move the full balance of your funds to a new address. This process will empty your paper wallet completely. The step to transfer the funds may require additional time to allow the network to confirm your transaction. This process could take up to one hour. In addition to the confirmation time, a small miner's fee may be applied. This fee could be in the amount of 0.0001BTC. If you are certain that you are the only one with access to the private key, it is safe to use the import feature. However, if you believe someone else may have access to the private key, sweeping is highly recommended. Listed in the following table are some common bitcoin wallets which support importing a private key: Bitcoin Wallet Comments Sweeping Coinbase https://www.coinbase.com/ This provides direct integration between your online wallet and your paper wallet. No Electrum https://electrum.org This provides the ability to import and see your private key for easy access to your wallet's funds. Yes Armory https://bitcoinarmory.com/ This provides the ability to import your private key or "sweep" the entire balance. Yes Multibit https://multibit.org/ This directly imports your private key. It may use a built-in address generator for change addresses. No Table 1 - Wallets that support importing private keys Importing your Paper wallet To import your wallet, simply log into your Coinbase account. Click on Tools from the left-hand side menu, followed by Paper Wallet from the top menu. Then, click on the Import a paper wallet button. You will be prompted to enter the private key of your paper wallet, as show in figure 3.5: Figure 3.5 - Coinbase importing from a paper wallet Simply enter the private key from your paper wallet. Coinbase will validate the key and ask you to confirm your import. If accepted, Coinbase will import your key and sweep your balance. The full amount will be transferred to your bitcoin wallet and become available after six confirmations. Paper wallet guidelines Paper wallets display your public and private keys in plain text. Make sure that you keep these documents secure. While you can send funds to your wallet multiple times, it is highly recommended that you spend your balance only once and in full. Before sending large amounts of bitcoin to a paper wallet, make sure you are able to test your ability to generate and import the paper wallet with small amounts of bitcoin. When you're comfortable with the process, you can rely on them for larger amounts. As paper is easily destroyed or ruined, make sure that you keep multiple copies of your paper wallet in different locations. Make sure the location is secure from unwanted access. Be careful with online wallet generators. A malicious site operator can obtain the private key from your web browser. Only use trusted paper wallet generators. You can test the online paper wallet generator by opening the page in your browser while online, and then disconnecting your computer from the network. You should be able to generate your paper wallet when completely disconnected from the network, ensuring that your private keys are never sent back to the network. Coinbase is an exception in the fact that it only sends the public address back to the server for reference. This public address is saved to make it easy to transfer funds to your paper wallet. The private key is never saved by Coinbase when generating a paper wallet. Paper wallet services In addition to the services mentioned, there are other services that make paper wallets easy to generate and print. Listed next in Table 2 are just a few: Service Notes BitAddress bitaddress.org This offers the ability to generate single wallets, bulk wallets, brainwallets, and more. Bitcoin Paper Wallet bitcoinpaperwallet.com This offers nice, stylish design, and easy-to-use features. Users can purchase holographic stickers securing the paper wallets. Wallet Generator walletgenerator.net This offers printable paper wallets that fold nicely to conceal the private keys.  Table 2 - Services for generating paper wallets and brainwallets Brainwallets Storing our private keys offline by using a paper wallet is one way we can protect our coins from attacks on the network. Yet, having a physical copy of our keys is similar to holding a gold bar: it's still vulnerable to theft if the attacker can physically obtain the wallet. One way to protect bitcoins from online or offline theft is to have the codes recallable by memory. As holding long random private keys in memory is quite difficult, even for the best of minds, we'll have to use another method to generate our private keys. Creating a brainwallet Brainwallet is a way to create one or more private keys from a long phrase of random words. From the phrase, called a passphrase, we're able to generate a private key, along with its public addresses, to store bitcoin. We can create any passphrase we'd like. The longer the phrase and the more random the characters, the more secure it will be. Brainwallet phrases should contain at least 12 words. It is very important that the phrase should never come from anything published, such as a book or a song. Hackers actively search for possible brainwallets by performing brute force attacks on commonly-published phrases. Here is an example of a brainwallet passphrase: gently continue prepare history bowl shy dog accident forgive strain dirt consume Note that the phrase is composed of 12 seemingly random words. One could use an easy-to-remember sentence rather than 12 words. Regardless of whether you record your passphrase on paper or memorize it, the idea is to use a passphrase that's easy to recall and type, yet difficult to crack. Don't let this happen to you: "Just lost 4 BTC out of a hacked brain wallet. The pass phrase was a line from an obscure poem in Afrikaans. Somebody out there has a really comprehensive dictionary attack program running." Reddit Thread (http://redd.it/1ptuf3) Unfortunately, this user lost their bitcoin because they chose a published line from a poem. Make sure that you choose a passphrase that is composed of multiple components of non-published text. Sadly, although warned, some users may resort to simple phrases that are easy to crack. Simple passwords such as 123456, password1, and iloveyou are still commonly used with e-mails, and login accounts are routinely cracked. Do not use simple passwords for your brainwallet passphrase. Make sure that you use at least 12 words with additional characters and numbers. Using the preceding paraphrase, we can generate our private key and public address using the many tools available online. We'll use an online service called BitAddress to generate the actual brainwallet from the passphrase. Simply open www.bitaddress.org in your browser. At first, BitAddress will ask you to move your mouse cursor around to collect enough random points to generate a seed for generating random numbers. This process could take a minute or two. Once opened, select the option Brain Wallet from the top menu. In the form presented, enter the passphrase and then enter it again to confirm. Click on View to see your private key and public address. For the example shown in figure 3.6, we'll use the preceding passphrase example: Figure 3.6 - BitAddress's brainwallet feature From the page, you can easily copy and paste the public address and use it for receiving Bitcoin. Later, when you're ready to spend the coins, enter the same exact passphrase to generate the same private key and public address. Referring to our Coinbase example from earlier in the article, we can then import the private key into our wallet. Increasing brainwallet Security As an early attempt to give people a way to "memorize" their Bitcoin wallet, brainwallets have become a target for hackers. Some users have chosen phrases or sentences from common books as their brainwallet. Unfortunately, the hackers who had access to large amounts of computing power were able to search for these phrases and were able to crack some brainwallets. To improve the security of brainwallets, other methods have been developed which make brainwallets more secure. One service, called brainwallet.io, executes a time-intensive cryptographic function over the brainwallet phrase to create a seed that is very difficult to crack. It's important to know that the phase phrases used with BitAddress are not compatible with brainwallet.io. To use brainwallet.io to generate a more secure brainwallet, open http://brainwallet.io: Figure 3.7 - brainwallet.io, a more secure brainwallet generator Brainwallet.io needs a sufficient amount of entropy to generate a private key which is difficult to reproduce. Entropy, in computer science, can describe data in terms of its predictability. When data has high entropy, it could mean that it's difficult to reproduce from known sources. When generating private keys, it's very important to use data that has high entropy. For generating brainwallet keys, we need data with high entropy, yet it should be easy for us to duplicate. To meet this requirement, brainwallet.io accepts your random passphrase, or can generate one from a list of random words. Additionally, it can use data from a file of your choice. Either way, the more entropy given, the stronger your passphrase will be. If you specify a passphrase, choosing at least 12 words is recommended. Next, brainwallet.io prompts you for salt, available in several forms: login info, personal info, or generic. Salts are used to add additional entropy to the generation of your private key. Their purpose is to prevent standard dictionary attacks against your passphrase. While using brainwallet.io, this information is never sent to the server. When ready, click the generate button, and the page will begin computing a scrypt function over your passphrase. Scrypt is a cryptographic function that requires computing time to execute. Due to the time required for each pass, it makes brute force attacks very difficult. brainwallet.io makes many thousands of passes to ensure that a strong seed is generated for the private key. Once finished, your new private key and public address, along with their QR codes, will be displayed for easy printing. As an alternative, WarpWallet is also available at https://keybase.io/warp. WarpWallet also computes a private key based on many thousands of scrypt passes over a passphrase and salt combination. Remember that brainwallet.io passphrases are not compatible with WarpWallet passphrases. Deterministic wallets We have introduced brainwallets that yield one private key and public address. They are designed for one time use and are practical for holding a fixed amount of bitcoin for a period of time. Yet, if we're making lots of transactions, it would be convenient to have the ability to generate unlimited public addresses so that we can use them to receive bitcoin from different transactions or to generate change addresses. A Type 1 Deterministic Wallet is a simple wallet schema based on a passphrase with an index appended. By incrementing the index, an unlimited number of addresses can be created. Each new address is indexed so that its private key can be quickly retrieved. Creating a deterministic wallet To create a deterministic wallet, simply choose a strong passphrase, as previously described, and then append a number to represent an individual private key and public address. It's practical to do this with a spreadsheet so that you can keep a list of public addresses on file. Then, when you want to spend the bitcoin, you simply regenerate the private key using the index. Let's walk through an example. First, we choose the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become" Then, we append an index, sequential number, to the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become0" "dress retreat save scratch decide simple army piece scent ocean hand become1" "dress retreat save scratch decide simple army piece scent ocean hand become2" "dress retreat save scratch decide simple army piece scent ocean hand become3" "dress retreat save scratch decide simple army piece scent ocean hand become4" Then, we take each passphrase, with the corresponding index, and run it through brainwallet.io, or any other brainwallet service, to generate the public address. Using a table or a spreadsheet, we can pre-generate a list of public addresses to receive bitcoin. Additionally, we can add a balance column to help track our money: Index Public Address Balance 0 1Bc2WZ2tiodYwYZCXRRrvzivKmrGKg2Ub9 0.00 1 1PXRtWnNYTXKQqgcxPDpXEvEDpkPKvKB82 0.00 2 1KdRGNADn7ipGdKb8VNcsk4exrHZZ7FuF2 0.00 3 1DNfd491t3ABLzFkYNRv8BWh8suJC9k6n2 0.00 4 17pZHju3KL4vVd2KRDDcoRdCs2RjyahXwt 0.00  Table 3 - Using a spreadsheet to track deterministic wallet addresses Spending from a deterministic wallet When we have money available in our wallet to spend, we can simply regenerate the private key for the index matching the public address. For example, let's say we have received 2BTC on the address starting with 1KdRGN in the preceding table. Since we know it belongs to index #2, we can reopen the brainwallet from the passphrase: "dress retreat save scratch decide simple army piece scent ocean hand become2" Using brainwallet.io as our brainwallet service, we quickly regenerate the original private key and public address: Figure 3.8 - Private key re-generated from a deterministic wallet Finally, we import the private key into our Bitcoin wallet, as described earlier in the article. If we don't want to keep the change in our online wallet, we can simply send the change back to the next available public address in our deterministic wallet. Pre-generating public addresses with deterministic wallets can be useful in many situations. Perhaps you want to do business with a partner and want to receive 12 payments over the course of one year. You can simply regenerate the 12 addresses and keep track of each payment using a spreadsheet. Another example could apply to an e-commerce site. If you'd like to receive payment for the goods or services being sold, you can pre-generate a long list of addresses. Only storing the public addresses on your website protects you from malicious attack on your web server. While Type 1 deterministic wallets are very useful, we'll introduce a more advanced version called the Type 2 Hierarchical Deterministic Wallet next. Type 2 Hierarchical Deterministic wallets Type 2 Hierarchical Deterministic (HD) wallets function similarly to Type 1 deterministic wallets, as they are able to generate an unlimited amount of private keys from a single passphrase, but they offer more advanced features. HD wallets are used by desktop, mobile, and hardware wallets as a way of securing an unlimited number of keys by a single passphrase. HD wallets are secured by a root seed. The root seed, generated from entropy, can be a number up to 64 bytes long. To make the root seed easier to save and recover, a phrase consisting of a list of mnemonic code words is rendered. The following is an example of a root seed: 01bd4085622ab35e0cd934adbdcce6ca To render the mnemonic code words, the root seed number plus its checksum is combined and then divided into groups of 11 bits. Each group of bits represents an index between 0 and 2047. The index is then mapped to a list of 2,048 words. For each group of bits, one word is listed, as shown in the following example, which generates the following phrase: essence forehead possess embarrass giggle spirit further understand fade appreciate angel suffocate BIP-0039 details the specifications for creating mnemonic code words to generate a deterministic key, and is available at https://en.bitcoin.it/wiki/BIP_0039. In the HD wallet, the root seed is used to generate a master private key and a master chain code. The master private key is used to generate a master public key, as with normal Bitcoin private keys and public keys. These keys are then used to generate additional children keys in a tree-like structure. Figure 3.9 illustrates the process of creating the master keys and chain code from a root seed: Figure 3.9 - Generating an HD Wallet's root seed, code words, and master keys Using a child key derivation function, children keys can be generated from the master or parent keys. An index is then combined with the keys and the chain code to generate and organize parent/child relationships. From each parent, two billion children keys can be created, and from each child's private key, the public key and public address can be created. In addition to generating a private key and a public address, each child can be used as a parent to generate its own list of child keys. This allows the organization of the derived keys in a tree-like structure. Hierarchically, an unlimited amount of keys can be created in this way. Figure 3.10 - The relationship between master seed, parent/child chains, and public addresses HD wallets are very practical as thousands of keys and public addresses can be managed by one seed. The entire tree of keys can be backed up and restored simply by the passphrase. HD wallets can be organized and shared in various useful ways. For example, in a company or organization, a parent key and chain code could be issued to generate a list of keys for each department. Each department would then have the ability to render its own set of private/public keys. Alternatively, a public parent key can be given to generate child public keys, but not the private keys. This can be useful in the example of an audit. The organization may want the auditor to perform a balance sheet on a set of public keys, but without access to the private keys for spending. Another use case for generating public keys from a parent public key is for e-commerce. As an example mentioned previously, you may have a website and would like to generate an unlimited amount of public addresses. By generating a public parent key for the website, the shopping card can create new public addresses in real time. HD wallets are very useful for Bitcoin wallet applications. Next, we'll look at a software package called Electrum for setting up an HD wallet to protect your bitcoins. Installing a HD wallet HD wallets are very convenient and practical. To show how we can manage an unlimited number of addresses by a single passphrase, we'll install an HD wallet software package called Electrum. Electrum is an easy-to-use desktop wallet that runs on Windows, OS/X, and Linux. It implements a secure HD wallet that is protected by a 12-word passphrase. It is able to synchronize with the blockchain, using servers that index all the Bitcoin transactions, to provide quick updates to your balances. Electrum has some nice features to help protect your bitcoins. It supports multi-signature transactions, that is transactions that require more than one key to spend coins. Multi-signature transactions are useful when you want to share the responsibility of a Bitcoin address between two or more parties, or to add an extra layer of protection to your Bitcoins. Additionally, Electrum has the ability to create a watching-only version of your wallet. This allows you to give access to your public keys to another party without releasing the private keys. This can be very useful for auditing or accounting purposes. To install Electrum, simply open the url https://electrum.org/#download and follow the instructions for your operating system. On first installation, Electrum will create for you a new wallet identified by a passphrase. Make sure that you protect this passphrase offline! Figure 3.11 - Recording the passphrase from an Electrum wallet Electrum will proceed by asking you to re-enter the passphrase to confirm you have it recorded. Finally, it will ask you for a password. This password is used to encrypt your wallet's seed and any private keys imported into your wallet on-disk. You will need this password any time you send bitcoins from your account. Bitcoins in cold storage If you are responsible for a large amount of bitcoin which can be exposed to online hacking or hardware failure, it is important to minimize your risk. A common schema for minimizing the risk is to split your online wallet between Hot wallet and Cold Storage. A hot wallet refers to your online wallet used for everyday deposits and withdrawals. Based on your customers' needs, you can store the minimum needed to cover the daily business. For example, Coinbase claims to hold approximately five percent of the total bitcoins on deposit in their hot wallet. The remaining amount is stored in cold storage. Cold storage is an offline wallet for bitcoin. Addresses are generated, typically from a deterministic wallet, with their passphrase and private keys stored offline. Periodically, depending on their day-to-day needs, bitcoins are transferred to and from the cold storage. Additionally, bitcoins may be moved to Deep cold storage. These bitcoins are generally more difficult to retrieve. While cold storage transfer may easily be done to cover the needs of the hot wallet, a deep cold storage schema may involve physically accessing the passphrase / private keys from a safe, a safety deposit box, or a bank vault. The reasoning is to slow down the access as much as possible. Cold storage with Electrum We can use Electrum to create a hot wallet and a cold storage wallet. To exemplify, let's imagine a business owner who wants to accept bitcoin from his PC cash register. For security reasons, he may want to allow access to the generation of new addresses to receive Bitcoin, but not access to spending them. Spending bitcoins from this wallet will be secured by a protected computer. To start, create a normal Electrum wallet on the protected computer. Secure the passphrase and assign a strong password to the wallet. Then, from the menu, select Wallet | Master Public Keys. The key will be displayed as shown in figure 3.12. Copy this number and save it to a USB key. Figure 3.12 - Your Electrum wallet's public master key Your master public key can be used to generate new public keys, but without access to the private keys. As mentioned in the previous examples, this has many practical uses, as in our example with the cash register. Next, from your cash register, install Electrum. On setup, or from File | New/Restore, choose Restore a wallet or import keys and the Standard wallet type: Figure 3.13 - Setting up a cash register wallet with Electrum On the next screen, Electrum will prompt you to enter your public master key. Once accepted, Electrum will generate your wallet from the public master key. When ready, your new wallet will be ready to accept bitcoin without access to the private keys. WARNING: If you import private keys into your Electrum wallet, they cannot be restored from your passphrase or public master key. They have not been generated by the root seed and exist independently in the wallet. If you import private keys, make sure to back up the wallet file after every import. Verifying access to a private key When working with public addresses, it may be important to prove that you have access to a private key. By using Bitcoin's cryptographic ability to sign a message, you can verify that you have access to the key without revealing it. This can be offered as proof from a trustee that they control the keys. Using Electrum's built-in message signing feature, we can use the private key in our wallet to sign a message. The message, combined with the digital signature and public address, can later be used to verify that it was signed with the original private key. To begin, choose an address from your wallet. In Electrum, your addresses can be found under the Addresses tab. Next, right click on an address and choose Sign/verify Message. A dialog box allowing you to sign a message will appear: Figure 3.14 - Electrum's Sign/Verify Message features As shown in figure 3.14, you can enter any message you like and sign it with the private key of the address shown. This process will produce a digital signature that can be shared with others to prove that you have access to the private key. To verify the signature on another computer, simply open Electrum and choose Tools | Sign | Verify Message from the menu. You will be prompted with the same dialog as shown in figure 3.14. Copy and paste the message, the address, and the digital signature, and click Verify. The results will be displayed. By requesting a signed message from someone, you can verify that they do, in fact, have control of the private key. This is useful for making sure that the trustee of a cold storage wallet has access to the private keys without releasing or sharing them. Another good  use of message signing is to prove that someone has control of some quantity of bitcoin. By signing a message that includes the public address with funds, one can see that the party is the owner of the funds. Finally, signing and verifying a message can be useful for testing your backups. You can test that your private key and public address completely offline without actually sending bitcoin to the address. Good housekeeping with Bitcoin To ensure the safe-keeping of your bitcoin, it's important to protect your private keys by following a short list of best practices: Never store your private keys unencrypted on your hard drive or in the cloud: Unencrypted wallets can easily be stolen by hackers, viruses, or malware. Make sure your keys are always encrypted before being saved to disk. Never send money to a Bitcoin address without a backup of the private keys: It's really important that you have a backup of your private key before sending money its public address. There are stories of early adopters who have lost significant amounts of bitcoin because of hardware failures or inadvertent mistakes. Always test your backup process by repeating the recovery steps: When setting up a backup plan, make sure to test your plan by backing up your keys, sending a small amount to the address, and recovering the amount from the backup. Message signing and verification is also a useful way to test your private key backups offline. Ensure that you have a secure location for your paper wallets: Unauthorized access to your paper wallets can result in the loss of your bitcoin. Make sure that you keep your wallets in a secure safe, in a bank safety deposit box, or in a vault. It's advisable to keep copies of your wallets in multiple locations. Keep multiple copies of your paper wallets: Paper can easily be damaged by water or direct sunlight. Make sure that you keep multiple copies of your paper wallets in plastic bags, protected from direct light with a cover. Consider writing a testament or will for your Bitcoin wallets: The testament should name who has access to the bitcoin and how they will be distributed. Make sure that you include instructions on how to recover the coins. Never forget your wallet's password or passphrase: This sounds obvious, but it must be emphasized. There is no way to recover a lost password or passphrase. Always use a strong passphrase: A strong passphrase should meet the following requirements: It should be long and difficult to guess It should not be from a famous publication: literature, holy books, and so on It should not contain personal information It should be easy to remember and type accurately It should not be reused between sites and applications Summary So far, we've covered the basics of how to get started with Bitcoin. We've provided a tutorial for setting up an online wallet and for how to buy Bitcoin in 15 minutes. We've covered online exchanges and marketplaces, and how to safely store and protect your bitcoin. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [article] Going Viral [article] Introduction to the Nmap Scripting Engine [article]
Read more
  • 0
  • 0
  • 6188

article-image-rotation-forest-classifier-ensemble-based-feature-extraction
Packt
28 Oct 2015
16 min read
Save for later

Rotation Forest - A Classifier Ensemble Based on Feature Extraction

Packt
28 Oct 2015
16 min read
 In this article by Gopi Subramanian author of the book Python Data Science Cookbook you will learn bagging methods based on decision tree-based algorithms are very popular among the data science community. Rotation Forest The claim to fame of most of these methods is that they need zero data preparation as compared to the other methods, can obtain very good results, and can be provided as a black box of tools in the hands of software engineers. By design, bagging lends itself nicely to parallelization. Hence, these methods can be easily applied on a very large dataset in a cluster environment. The decision tree algorithms split the input data into various regions at each level of the tree. Thus, they perform implicit feature selection. Feature selection is one of the most important tasks in building a good model. By providing implicit feature selection, trees are at an advantageous position as compared to other techniques. Hence, bagging with decision trees comes with this advantage. Almost no data preparation is needed for decision trees. For example, consider the scaling of attributes. Attribute scaling has no impact on the structure of the trees. The missing values also do not affect decision trees. The effect of outliers is very minimal on a decision tree. We don’t have to do explicit feature transformations to accommodate feature interactions. One of the major complaints against tree-based methods is the difficulty with pruning the trees to avoid overfitting. Big trees tend to also fit the noise present in the underlying data and hence lead to a low bias and high variance. However, when we grow a lot of trees and the final prediction is an average of the output of all the trees in the ensemble, we avoid these problems. In this article, we will see a powerful tree-based ensemble method called rotation forest. A typical random forest requires a large number of trees to be a part of its ensemble in order to achieve good performance. Rotation forest can achieve similar or better performance with less number of trees. Additionally, the authors of this algorithm claim that the underlying estimator can be anything other than a tree. This way, it is projected as a new framework to build an ensemble similar to gradient boosting. (For more resources related to this topic, see here.) The algorithm Random forest and bagging gives impressive results with very large ensembles; having a large number of estimators results in the improvement of the accuracy of these methods. On the contrary, rotation forest is designed to work with a smaller number of ensembles. Let's write down the steps involved in building a rotation forest. The number of trees required in the forest is typically specified by the user. Let T be the number of trees required to be built. We start with iterating from one through T, that is, we build T trees. For each tree T, perform the following steps: Split the attributes in the training set into K nonoverlapping subsets of equal size. We have K datasets, each with K attributes. For each of the K datasets, we proceed to do the following: Bootstrap 75% of the data from each K dataset and use the bootstrapped sample for further steps. Run a principal component analysis on the ith subset in K. Retain all the principal components. For every feature j in the Kth subset, we have a principle component, a. Let's denote it as aij, where it’s the principal component for the jth attribute in the ith subset. Store the principal components for the subset. Create a rotation matrix of size, n X n, where n is the total number of attributes. Arrange the principal component in the matrix such that the components match the position of the feature in the original training dataset. Project the training dataset on the rotation matrix using the matrix multiplication. Build a decision tree with the projected dataset. Store the tree and rotation matrix.  A quick note about PCA: PCA is an unsupervised method. In multivariate problems, PCA is used to reduce the dimension of the data with minimal information loss or, in other words, retain maximum variation in the data. In PCA, we find the directions in the data with the most variation, that is, the eigenvectors corresponding to the largest eigenvalues of the covariance matrix and project the data onto these directions. With a dataset (n x m) with n instances and m dimensions, PCA projects it onto a smaller subspace (n x d), where d << m. A point to note is that PCA is computationally very expensive. Programming rotation forest in Python Now let's write a Python code to implement rotation forest. We will proceed to test it on a classification problem. We will generate some classification dataset to demonstrate rotation forest. To our knowledge, there is no Python implementation available for rotation forest. We will leverage scikit-learns’s implementation of the decision tree classifier and use the train_test_split method for the bootstrapping. Refer to the following link to learn more about scikit-learn: http://scikit-learn.org/stable/ First write the necessary code to implement rotation forest and apply it on a classification problem. We will start with loading all the necessary libraries. Let's leverage the make_classification method from the sklearn.dataset module to generate the training data. We will follow it with a method to select a random subset of attributes called gen_random_subset: from sklearn.datasets import make_classification from sklearn.metrics import classification_report from sklearn.cross_validation import train_test_split from sklearn.decomposition import PCA from sklearn.tree import DecisionTreeClassifier import numpy as np def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ no_features = 50 redundant_features = int(0.1*no_features) informative_features = int(0.6*no_features) repeated_features = int(0.1*no_features) x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03, n_informative = informative_features, n_redundant = redundant_features ,n_repeated = repeated_features,random_state=7) return x,y def get_random_subset(iterable,k): subsets = [] iteration = 0 np.random.shuffle(iterable) subset = 0 limit = len(iterable)/k while iteration < limit: if k <= len(iterable): subset = k else: subset = len(iterable) subsets.append(iterable[-subset:]) del iterable[-subset:] iteration+=1 return subsets  We will write the build_rotationtree_model function, where we will build fully-grown trees and proceed to evaluate the forest’s performance using the model_worth function: def build_rotationtree_model(x_train,y_train,d,k): models = [] r_matrices = [] feature_subsets = [] for i in range(d): x,_,_,_ = train_test_split(x_train,y_train,test_size=0.3,random_state=7) # Features ids feature_index = range(x.shape[1]) # Get subsets of features random_k_subset = get_random_subset(feature_index,k) feature_subsets.append(random_k_subset) # Rotation matrix R_matrix = np.zeros((x.shape[1],x.shape[1]),dtype=float) for each_subset in random_k_subset: pca = PCA() x_subset = x[:,each_subset] pca.fit(x_subset) for ii in range(0,len(pca.components_)): for jj in range(0,len(pca.components_)): R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] x_transformed = x_train.dot(R_matrix) model = DecisionTreeClassifier() model.fit(x_transformed,y_train) models.append(model) r_matrices.append(R_matrix) return models,r_matrices,feature_subsets def model_worth(models,r_matrices,x,y): predicted_ys = [] for i,model in enumerate(models): x_mod = x.dot(r_matrices[i]) predicted_y = model.predict(x_mod) predicted_ys.append(predicted_y) predicted_matrix = np.asmatrix(predicted_ys) final_prediction = [] for i in range(len(y)): pred_from_all_models = np.ravel(predicted_matrix[:,i]) non_zero_pred = np.nonzero(pred_from_all_models)[0] is_one = len(non_zero_pred) > len(models)/2 final_prediction.append(is_one) print classification_report(y, final_prediction)  Finally, we will write a main function used to invoke the functions that we have defined: if __name__ == "__main__": x,y = get_data() # plot_data(x,y) # Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9) x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9) # Build a bag of models models,r_matrices,features = build_rotationtree_model(x_train,y_train,25,5) model_worth(models,r_matrices,x_train,y_train) model_worth(models,r_matrices,x_dev,y_dev) Walking through our code  Let's start with our main function. We will invoke get_data to get our predictor attributes in the response attributes. In get_data, we will leverage the make_classification dataset to generate our training data for our recipe: def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ no_features = 30 redundant_features = int(0.1*no_features) informative_features = int(0.6*no_features) repeated_features = int(0.1*no_features) x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03, n_informative = informative_features, n_redundant = redundant_features ,n_repeated = repeated_features,random_state=7) return x,y Let's look at the parameters passed to the make_classification method. The first parameter is the number of instances required; in this case, we say we need 500 instances. The second parameter is about how many attributes per instance are required. We say that we need 30. The third parameter, flip_y, randomly interchanges 3% of the instances. This is done to introduce some noise in our data. The next parameter is how many of these 30 features should be informative enough to be used in our classification. We have specified that 60% of our features, that is, 18 out of 30 should be informative. The next parameter is about redundant features. These are generated as a linear combination of the informative features in order to introduce correlation among the features. Finally, the repeated features are duplicate features, which are drawn randomly from both the informative and redundant features. Let's split the data into training and testing sets using train_test_split. We will reserve 30% of our data to test:  Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)  Once again, we will leverage train_test_split to split our test data into dev and test sets: x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)  With the data divided to build, evaluate, and test the model, we will proceed to build our models: models,r_matrices,features = build_rotationtree_model(x_train,y_train,25,5) We will invoke the build_rotationtree_model function to build our rotation forest. We will pass our training data, predictor, x_train  and response variable, y_train, the total number of trees to be built—25 in this case—and finally, the subset of features to be used—5 in this case. Let's jump into this function:  models = [] r_matrices = [] feature_subsets = []  We will begin with declaring three lists to store each of the decision tree, rotation matrix for this tree, and subset of features used in this iteration. We will proceed to build each tree in our ensemble. As a first order of business, we will bootstrap to retain only 75% of the data: x,_,_,_ = train_test_split(x_train,y_train,test_size=0.3,random_state=7)  We will leverage the train_test_split function from scikit-learn for the bootstrapping. We will then decide the feature subsets: # Features ids feature_index = range(x.shape[1]) # Get subsets of features random_k_subset = get_random_subset(feature_index,k) feature_subsets.append(random_k_subset)  The get_random_subset function takes the feature index and number of subsets required as parameters and returns K subsets.  In this function, we will shuffle the feature index. The feature index is an array of numbers starting from zero and ending with the number of features in our training set: np.random.shuffle(iterable)  Let's say that we have ten features and our k value is five, indicating that we need subsets with five nonoverlapping feature indices and we need to do two iterations. We will store the number of iterations needed in the limit variable: limit = len(iterable)/k while iteration < limit: if k <= len(iterable): subset = k else: subset = len(iterable) iteration+=1  If our required subset is less than the total number of attributes, we can proceed to use the first k entries in our iterable. As we shuffled our iterables, we will be returning different volumes at different times: subsets.append(iterable[-subset:])  On selecting a subset, we will remove it from the iterable as we need nonoverlapping sets: del iterable[-subset:]  With all the subsets ready, we will declare our rotation matrix: del iterable[-subset:]  With all the subsets ready, we will declare our rotation matrix: # Rotation matrix R_matrix = np.zeros((x.shape[1],x.shape[1]),dtype=float)  As you can see, our rotation matrix is of size, n x n, where is the number of attributes in our dataset. You can see that we have used the shape attribute to declare this matrix filled with zeros: for each_subset in random_k_subset: pca = PCA() x_subset = x[:,each_subset] pca.fit(x_subset)  For each of the K subsets of data having only K features, we will proceed to do a principal component analysis. We will fill our rotation matrix with the component values: for ii in range(0,len(pca.components_)): for jj in range(0,len(pca.components_)): R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] For example, let's say that we have three attributes in our subset in a total of six attributes. For illustration, let's say that our subsets are as follows: 2,4,6 and 1,3,5 Our rotation matrix, R, is of size, 6 x 6. Assume that we want to fill the rotation matrix for the first subset of features. We will have three principal components, one each for 2,4, and 6 of size, 1 x 3. The output of PCA from scikit-learn is a matrix of size components X features. We will go through each component value in the for loop. At the first run, our feature of interest is two, and the cell (0,0) in the component matrix output from PCA gives the value of contribution of feature two to component one. We have to find the right place in the rotation matrix for this value. We will use the index from the component matrices, ii and jj, with the subset list to get the right place in the rotation matrix: R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] The each_subset[0] and each_subset[0] will put us in cell (2,2) in the rotation matrix. As we go through the loop, the next component value in cell (0,1) in the component matrix will be placed in cell (2,4) in the rotation matrix and the last one in cell (2,6) in the rotation matrix. This is done for all the attributes in the first subset. Let's go to the second subset; here the first attribute is one. The cell (0,0) of the component matrix corresponds to the cell (1,1) in the rotation matrix. Proceeding this way, you will notice that the attribute component values are arranged in the same order as the attributes themselves. With our rotation matrix ready, let's project our input onto the rotation matrix: x_transformed = x_train.dot(R_matrix) It’s time now to fit our decision tree: model = DecisionTreeClassifier() model.fit(x_transformed,y_train) Finally, we will store our models and the corresponding rotation matrices: models.append(model) r_matrices.append(R_matrix) With our model built, let's proceed to see how good our model is in both the train and dev datasets using the model_worth function:  model_worth(models,r_matrices,x_train,y_train) model_worth(models,r_matrices,x_dev,y_dev) Let's see our model_worth function: for i,model in enumerate(models): x_mod = x.dot(r_matrices[i]) predicted_y = model.predict(x_mod) predicted_ys.append(predicted_y) In this function, perform prediction using each of the trees that we built. However, before doing the prediction, we will project our input using the rotation matrix. We will store all our prediction output in a list called predicted_ys. Let's say that we have 100 instances to predict and ten models in our tree; for each instance, we have ten predictions. We will store these as a matrix for convenience: predicted_matrix = np.asmatrix(predicted_ys) Now, we will proceed to give a final classification for each of our input records: final_prediction = [] for i in range(len(y)): pred_from_all_models = np.ravel(predicted_matrix[:,i]) non_zero_pred = np.nonzero(pred_from_all_models)[0] is_one = len(non_zero_pred) > len(models)/2 final_prediction.append(is_one) We will store our final prediction in a list called final_prediction. We will go through each of the predictions for our instance. Let's say that we are in the first instance (i=0 in our for loop); pred_from_all_models stores the output from all the trees in our model. It’s an array of zeros and ones indicating which class has the model classified in this instance. We will make another array out of it, non_zero_pred, which has only those entries from the parent arrays that are non-zero. Finally, if the length of this non-zero array is greater than half the number of models that we have, we say that our final prediction is one for the instance of interest. What we have accomplished here is the classic voting scheme. Let's look at how good our models are now by calling a classification report: print classification_report(y, final_prediction) Here is how good our model has performed on the training set: Let's see how good our model performance is on the dev dataset: References More information about rotation forest can be obtained from the following paper:. Rotation Forest: A New Classifier Ensemble Method Juan J. Rodrı´guez, Member, IEEE Computer Society, Ludmila I. Kuncheva, Member, IEEE, and Carlos J. Alonso The paper also claims that when rotation forest was compared to bagging, AdBoost, and random forest on 33 datasets, rotation forest outperformed all the other three algorithms. Similar to gradient boosting, authors of the paper claim that rotation forest is an overall framework and the underlying ensemble is not necessary to be a decision tree. Work is in progress in testing other algorithms such as Naïve Bayes, Neural Networks, and similar others. Summary In this article will learnt the bagging methods based on decision tree-based algorithms are very popular among the data science community. Resources for Article: Further resources on this subject: Mobile Phone Forensics – A First Step into Android Forensics [article] Develop a Digital Clock [article] Monitoring Physical Network Bandwidth Using OpenStack Ceilometer [article]
Read more
  • 0
  • 0
  • 10885

article-image-introduction-kibana
Packt
28 Oct 2015
28 min read
Save for later

An Introduction to Kibana

Packt
28 Oct 2015
28 min read
In this article by Yuvraj Gupta, author of the book, Kibana Essentials, explains Kibana is a tool that is part of the ELK stack, which consists of Elasticsearch, Logstash, and Kibana. It is built and developed by Elastic. Kibana is a visualization platform that is built on top of Elasticsearch and leverages the functionalities of Elasticsearch. (For more resources related to this topic, see here.) To understand Kibana better, let's check out the following diagram: This diagram shows that Logstash is used to push data directly into Elasticsearch. This data is not limited to log data, but can include any type of data. Elasticsearch stores data that comes as input from Logstash, and Kibana uses the data stored in Elasticsearch to provide visualizations. So, Logstash provides an input stream of data to Elasticsearch, from which Kibana accesses the data and uses it to create visualizations. Kibana acts as an over-the-top layer of Elasticsearch, providing beautiful visualizations for data (structured or nonstructured) stored in it. Kibana is an open source analytics product used to search, view, and analyze data. It provides various types of visualizations to visualize data in the form of tables, charts, maps, histograms, and so on. It also provides a web-based interface that can easily handle a large amount of data. It helps create dashboards that are easy to create and helps query data in real time. Dashboards are nothing but an interface for underlying JSON documents. They are used for saving, templating, and exporting. They are simple to set up and use, which helps us play with data stored in Elasticsearch in minutes without requiring any coding. Kibana is an Apache-licensed product that aims to provide a flexible interface combined with the powerful searching capabilities of Elasticsearch. It requires a web server (included in the Kibana 4 package) and any modern web browser, that is, a browser that supports industry standards and renders the web page in the same way across all browsers, to work. It connects to Elasticsearch using the REST API. It helps to visualize data in real time with the use of dashboards to provide real-time insights. As Kibana uses the functionalities of Elasticsearch, it is easier to learn Kibana by understanding the core functionalities of Elasticsearch. In this article, we are going to take a look at the following topics: The basic concepts of Elasticsearch Installation of Java Installation of Elasticsearch Installation of Kibana Importing a JSON file into Elasticsearch Understanding Elasticsearch Elasticsearch is a search server built on top of Lucene (licensed under Apache), which is completely written in Java. It supports distributed searches in a multitenant environment. It is a scalable search engine allowing high flexibility of adding machines easily. It provides a full-text search engine combined with a RESTful web interface and JSON documents. Elasticsearch harnesses the functionalities of Lucene Java Libraries, adding up by providing proper APIs, scalability, and flexibility on top of the Lucene full-text search library. All querying done using Elasticsearch, that is, searching text, matching text, creating indexes, and so on, is implemented by Apache Lucene. Without a setup of an Elastic shield or any other proxy mechanism, any user with access to Elasticsearch API can view all the data stored in the cluster. The basic concepts of Elasticsearch Let's explore some of the basic concepts of Elasticsearch: Field: This is the smallest single unit of data stored in Elasticsearch. It is similar to a column in a traditional relational database. Every document contains key-value pairs, which are referred to as fields. Values in a field can contain a single value, such as integer [27], string ["Kibana"], or multiple values, such as array [1, 2, 3, 4, 5]. The field type is responsible for specifying which type of data can be stored in a particular field, for example, integer, string, date, and so on. Document: This is the simplest unit of information stored in Elasticsearch. It is a collection of fields. It is considered similar to a row of a table in a traditional relational database. A document can contain any type of entry, such as a document for a single restaurant, another document for a single cuisine, and yet another for a single order. Documents are in JavaScript Object Notation (JSON), which is a language-independent data interchange format. JSON contains key-value pairs. Every document that is stored in Elasticsearch is indexed. Every document contains a type and an ID. An example of a document that has JSON values is as follows: { "name": "Yuvraj", "age": 22, "birthdate": "2015-07-27", "bank_balance": 10500.50, "interests": ["playing games","movies","travelling"], "movie": {"name":"Titanic","genre":"Romance","year" : 1997} } In the preceding example, we can see that the document supports JSON, having key-value pairs, which are explained as follows: The name field is of the string type The age field is of the numeric type The birthdate field is of the date type The bank_balance field is of the float type The interests field contains an array The movie field contains an object (dictionary) Type: This is similar to a table in a traditional relational database. It contains a list of fields, which is defined for every document. A type is a logical segregation of indexes, whose interpretation/semantics entirely depends on you. For example, you have data about the world and you put all your data into an index. In this index, you can define a type for continent-wise data, another type for country-wise data, and a third type for region-wise data. Types are used with a mapping API; it specifies the type of its field. An example of type mapping is as follows: { "user": { "properties": { "name": { "type": "string" }, "age": { "type": "integer" }, "birthdate": { "type": "date" }, "bank_balance": { "type": "float" }, "interests": { "type": "string" }, "movie": { "properties": { "name": { "type": "string" }, "genre": { "type": "string" }, "year": { "type": "integer" } } } } } } Now, let's take a look at the core data types specified in Elasticsearch, as follows: Type Definition string This contains text, for example, "Kibana" integer This contains a 32-bit integer, for example, 7 long This contains a 64-bit integer float IEEE float, for example, 2.7 double This is a double-precision float boolean This can be true or false date This is the UTC date/time, for example, "2015-06-30T13:10:10" geo_point This is the latitude or longitude Index: This is a collection of documents (one or more than one). It is similar to a database in the analogy with traditional relational databases. For example, you can have an index for user information, transaction information, and product type. An index has a mapping; this mapping is used to define multiple types. In other words, an index can contain single or multiple types. An index is defined by a name, which is always used whenever referring to an index to perform search, update, and delete operations for documents. You can define any number of indexes you require. Indexes also act as logical namespaces that map documents to primary shards, which contain zero or more replica shards for replicating data. With respect to traditional databases, the basic analogy is similar to the following: MySQL => Databases => Tables => Columns/Rows Elasticsearch => Indexes => Types => Documents with Fields You can store a single document or multiple documents within a type or index. As a document is within an index, it must also be assigned to a type within an index. Moreover, the maximum number of documents that you can store in a single index is 2,147,483,519 (2 billion 147 million), which is equivalent to Integer.Max_Value. ID: This is an identifier for a document. It is used to identify each document. If it is not defined, it is autogenerated for every document.The combination of index, type, and ID must be unique for each document. Mapping: Mappings are similar to schemas in a traditional relational database. Every document in an index has a type. A mapping defines the fields, the data type for each field, and how the field should be handled by Elasticsearch. By default, a mapping is automatically generated whenever a document is indexed. If the default settings are overridden, then the mapping's definition has to be provided explicitly. Node: This is a running instance of Elasticsearch. Each node is part of a cluster. On a standalone machine, each Elasticsearch server instance corresponds to a node. Multiple nodes can be started on a single standalone machine or a single cluster. The node is responsible for storing data and helps in the indexing/searching capabilities of a cluster. By default, whenever a node is started, it is identified and assigned a random Marvel Comics character name. You can change the configuration file to name nodes as per your requirement. A node also needs to be configured in order to join a cluster, which is identifiable by the cluster name. By default, all nodes join the Elasticsearch cluster; that is, if any number of nodes are started up on a network/machine, they will automatically join the Elasticsearch cluster. Cluster: This is a collection of nodes and has one or multiple nodes; they share a single cluster name. Each cluster automatically chooses a master node, which is replaced if it fails; that is, if the master node fails, another random node will be chosen as the new master node, thus providing high availability. The cluster is responsible for holding all of the data stored and provides a unified view for search capabilities across all nodes. By default, the cluster name is Elasticsearch, and it is the identifiable parameter for all nodes in a cluster. All nodes, by default, join the Elasticsearch cluster. While using a cluster in the production phase, it is advisable to change the cluster name for ease of identification, but the default name can be used for any other purpose, such as development or testing.The Elasticsearch cluster contains single or multiple indexes, which contain single or multiple types. All types contain single or multiple documents, and every document contains single or multiple fields. Sharding: This is an important concept of Elasticsearch while understanding how Elasticsearch allows scaling of nodes, when having a large amount of data termed as big data. An index can store any amount of data, but if it exceeds its disk limit, then searching would become slow and be affected. For example, the disk limit is 1 TB, and an index contains a large number of documents, which may not fit completely within 1 TB in a single node. To counter such problems, Elasticsearch provides shards. These break the index into multiple pieces. Each shard acts as an independent index that is hosted on a node within a cluster. Elasticsearch is responsible for distributing shards among nodes. There are two purposes of sharding: allowing horizontal scaling of the content volume, and improving performance by providing parallel operations across various shards that are distributed on nodes (single or multiple, depending on the number of nodes running).Elasticsearch helps move shards among multiple nodes in the event of an addition of new nodes or a node failure. There are two types of shards, as follows: Primary shard: Every document is stored within a primary index. By default, every index has five primary shards. This parameter is configurable and can be changed to define more or fewer shards as per the requirement. A primary shard has to be defined before the creation of an index. If no parameters are defined, then five primary shards will automatically be created.Whenever a document is indexed, it is usually done on a primary shard initially, followed by replicas. The number of primary shards defined in an index cannot be altered once the index is created. Replica shard: Replica shards are an important feature of Elasticsearch. They help provide high availability across nodes in the cluster. By default, every primary shard has one replica shard. However, every primary shard can have zero or more replica shards as required. In an environment where failure directly affects the enterprise, it is highly recommended to use a system that provides a failover mechanism to achieve high availability. To counter this problem, Elasticsearch provides a mechanism in which it creates single or multiple copies of indexes, and these are termed as replica shards or replicas. A replica shard is a full copy of the primary shard. Replica shards can be dynamically altered. Now, let's see the purposes of creating a replica. It provides high availability in the event of failure of a node or a primary shard. If there is a failure of a primary shard, replica shards are automatically promoted to primary shards. Increase performance by providing parallel operations on replica shards to handle search requests.A replica shard is never kept on the same node as that of the primary shard from which it was copied. Inverted index: This is also a very important concept in Elasticsearch. It is used to provide fast full-text search. Instead of searching text, it searches for an index. It creates an index that lists unique words occurring in a document, along with the document list in which each word occurs. For example, suppose we have three documents. They have a text field, and it contains the following: I am learning Kibana Kibana is an amazing product Kibana is easy to use To create an inverted index, the text field is broken into words (also known as terms), a list of unique words is created, and also a listing is done of the document in which the term occurs, as shown in this table: Term Doc 1 Doc 2 Doc 3 I X     Am X     Learning X     Kibana X X X Is   X X An   X   Amazing   X   Product   X   Easy     X To     X Use     X Now, if we search for is Kibana, Elasticsearch will use an inverted index to display the results: Term Doc 1 Doc 2 Doc 3 Is   X X Kibana X X X With inverted indexes, Elasticsearch uses the functionality of Lucene to provide fast full-text search results. An inverted index uses an index based on keywords (terms) instead of a document-based index. REST API: This stands for Representational State Transfer. It is a stateless client-server protocol that uses HTTP requests to store, view, and delete data. It supports CRUD operations (short for Create, Read, Update, and Delete) using HTTP. It is used to communicate with Elasticsearch and is implemented by all languages. It communicates with Elasticsearch over port 9200 (by default), which is accessible from any web browser. Also, Elasticsearch can be directly communicated with via the command line using the curl command. cURL is a command-line tool used to send, view, or delete data using URL syntax, as followed by the HTTP structure. A cURL request is similar to an HTTP request, which is as follows: curl -X <VERB> '<PROTOCOL>://<HOSTNAME>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>' The terms marked within the <> tags are variables, which are defined as follows: VERB: This is used to provide an appropriate HTTP method, such as GET (to get data), POST, PUT (to store data), or DELETE (to delete data). PROTOCOL: This is used to define whether the HTTP or HTTPS protocol is used to send requests. HOSTNAME: This is used to define the hostname of a node present in the Elasticsearch cluster. By default, the hostname of Elasticsearch is localhost. PORT: This is used to define the port on which Elasticsearch is running. By default, Elasticsearch runs on port 9200. PATH: This is used to define the index, type, and ID where the documents will be stored, searched, or deleted. It is specified as index/type/ID. QUERY_STRING: This is used to define any additional query parameter for searching data. BODY: This is used to define a JSON-encoded request within the body. In order to put data into Elasticsearch, the following curl command is used: curl -XPUT 'http://localhost:9200/testing/test/1' -d '{"name": "Kibana" }' Here, testing is the name of the index, test is the name of the type within the index, and 1 indicates the ID number. To search for the preceding stored data, the following curl command is used: curl -XGET 'http://localhost:9200/testing/_search? The preceding commands are provided just to give you an overview of the format of the curl command. Prerequisites for installing Kibana 4.1.1 The following pieces of software need to be installed before installing Kibana 4.1.1: Java 1.8u20+ Elasticsearch v1.4.4+ A modern web browser—IE 10+, Firefox, Chrome, Safari, and so on The installation process will be covered separately for Windows and Ubuntu so that both types of users are able to understand the process of installation easily. Installation of Java In this section, JDK needs to be installed so as to access Elasticsearch. Oracle Java 8 (update 20 onwards) will be installed as it is the recommended version for Elasticsearch from version 1.4.4 onwards. Installation of Java on Ubuntu 14.04 Install Java 8 using the terminal and the apt package in the following manner: Add the Oracle Java Personal Package Archive (PPA) to the apt repository list: sudo add-apt-repository -y ppa:webupd8team/java In this case, we use a third-party repository; however, the WebUpd8 team is trusted to install Java. It does not include any Java binaries. Instead, the PPA directly downloads from Oracle and installs it. As shown in the preceding screenshot, you will initially be prompted for the password for running the sudo command (only when you have not logged in as root), and on successful addition to the repository, you will receive an OK message, which means that the repository has been imported. Update the apt package database to include all the latest files under the packages: sudo apt-get update Install the latest version of Oracle Java 8: sudo apt-get -y install oracle-java8-installer Also, during the installation, you will be prompted to accept the license agreement, which pops up as follows: To check whether Java has been successfully installed, type the following command in the terminal:java –version This signifies that Java has been installed successfully. Installation of Java on Windows We can install Java on windows by going through the following steps: Download the latest version of the Java JDK from the Sun Microsystems site at http://www.oracle.com/technetwork/java/javase/downloads/index.html:                                                                                     As shown in the preceding screenshot, click on the DOWNLOAD button of JDK to download. You will be redirected to the download page. There, you have to first click on the Accept License Agreement radio button, followed by the Windows version to download the .exe file, as shown here: Double-click on the file to be installed and it will open as an installer. Click on Next, accept the license by reading it, and keep clicking on Next until it shows that JDK has been installed successfully. Now, to run Java on Windows, you need to set the path of JAVA in the environment variable settings of Windows. Firstly, open the properties of My Computer. Select the Advanced system settings and then click on the Advanced tab, wherein you have to click on the environment variables option, as shown in this screenshot: After opening Environment Variables, click on New (under the System variables) and give the variable name as JAVA_HOME and variable value as C:Program FilesJavajdk1.8.0_45 (do check in your system where jdk has been installed and provide the path corresponding to the version installed as mentioned in system directory), as shown in the following screenshot: Then, double-click on the Path variable (under the System variables) and move towards the end of textbox. Insert a semicolon if it is not already inserted, and add the location of the bin folder of JDK, like this: %JAVA_HOME%bin. Next, click on OK in all the windows opened. Do not delete anything within the path variable textbox. To check whether Java is installed or not, type the following command in Command Prompt: java –version This signifies that Java has been installed successfully. Installation of Elasticsearch In this section, Elasticsearch, which is required to access Kibana, will be installed. Elasticsearch v1.5.2 will be installed, and this section covers the installation on Ubuntu and Windows separately. Installation of Elasticsearch on Ubuntu 14.04 To install Elasticsearch on Ubuntu, perform the following steps: Download Elasticsearch v 1.5.2 as a .tar file using the following command on the terminal:  curl -L -O https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.5.2.tar.gz Curl is a package that may not be installed on Ubuntu by the user. To use curl, you need to install the curl package, which can be done using the following command: sudo apt-get -y install curl Extract the downloaded .tar file using this command: tar -xvzf elasticsearch-1.5.2.tar.gzThis will extract the files and folder into the current working directory. Navigate to the bin directory within the elasticsearch-1.5.2 directory: cd elasticsearch-1.5.2/bin Now run Elasticsearch to start the node and cluster, using the following command:./elasticsearch The preceding screenshot shows that the Elasticsearch node has been started, and it has been given a random Marvel Comics character name. If this terminal is closed, Elasticsearch will stop running as this node will shut down. However, if you have multiple Elasticsearch nodes running, then shutting down a node will not result in shutting down Elasticsearch. To verify the Elasticsearch installation, open http://localhost:9200 in your browser. Installation of Elasticsearch on Windows The installation on Windows can be done by following similar steps as in the case of Ubuntu. To use curl commands on Windows, we will be installing GIT. GIT will also be used to import a sample JSON file into Elasticsearch using elasticdump, as described in the Importing a JSON file into Elasticsearch section. Installation of GIT To run curl commands on Windows, first download and install GIT, then perform the following steps: Download the GIT ZIP package from https://git-scm.com/download/win. Double-click on the downloaded file, which will walk you through the installation process. Keep clicking on Next by not changing the default options until the Finish button is clicked on. To validate the GIT installation, right-click on any folder in which you should be able to see the options of GIT, such as GIT Bash, as shown in the following screenshot: The following are the steps required to install Elasticsearch on Windows: Open GIT Bash and enter the following command in the terminal:  curl –L –O https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.5.2.zip Extract the downloaded ZIP package by either unzipping it using WinRar, 7Zip, and so on (if you don't have any of these, download one of them) or using the following command in GIT Bash: unzip elasticsearch-1.5.2.zip This will extract the files and folder into the directory. Then click on the extracted folder and navigate through it to reach the bin folder. Click on the elasticsearch.bat file to run Elasticsearch. The preceding screenshot shows that the Elasticsearch node has been started, and it is given a random Marvel Comics character's name. Again, if this window is closed, Elasticsearch will stop running as this node will shut down. However, if you have multiple Elasticsearch nodes running, then shutting down a node will not result in shutting down Elasticsearch. To verify the Elasticsearch installation, open http://localhost:9200 in your browser. Installation of Kibana In this section, Kibana will be installed. We will install Kibana v4.1.1, and this section covers installations on Ubuntu and Windows separately. Installation of Kibana on Ubuntu 14.04 To install Kibana on Ubuntu, follow these steps: Download Kibana version 4.1.1 as a .tar file using the following command in the terminal:  curl -L -O https://download.elasticsearch.org/kibana/kibana/kibana-4.1.1-linux-x64.tar.gz Extract the downloaded .tar file using this command: tar -xvzf kibana-4.1.1-linux-x64.tar.gz The preceding command will extract the files and folder into the current working directory. Navigate to the bin directory within the kibana-4.1.1-linux-x64 directory: cd kibana-4.1.1-linux-x64/bin Now run Kibana to start the node and cluster using the following command: Make sure that Elasticsearch is running. If it is not running and you try to start Kibana, the following error will be displayed after you run the preceding command: To verify the Kibana installation, open http://localhost:5601 in your browser. Installation of Kibana on Windows To install Kibana on Windows, perform the following steps: Open GIT Bash and enter the following command in the terminal:  curl -L -O https://download.elasticsearch.org/kibana/kibana/kibana-4.1.1-windows.zip Extract the downloaded ZIP package by either unzipping it using WinRar or 7Zip (download it if you don't have it), or using the following command in GIT Bash: unzip kibana-4.1.1-windows.zip This will extract the files and folder into the directory. Then click on the extracted folder and navigate through it to get to the bin folder. Click on the kibana.bat file to run Kibana. Make sure that Elasticsearch is running. If it is not running and you try to start Kibana, the following error will be displayed after you click on the kibana.bat file: Again, to verify the Kibana installation, open http://localhost:5601 in your browser. Additional information You can change the Elasticsearch configuration for your production environment, wherein you have to change parameters such as the cluster name, node name, network address, and so on. This can be done using the information mentioned in the upcoming sections.. Changing the Elasticsearch configuration To change the Elasticsearch configuration, perform the following steps: Run the following command in the terminal to open the configuration file: sudo vi ~/elasticsearch-1.5.2/config/elasticsearch.yml Windows users can open the elasticsearch.yml file from the config folder. This will open the configuration file as follows: The cluster name can be changed, as follows: #cluster.name: elasticsearch to cluster.name: "your_cluster_name". In the preceding figure, the cluster name has been changed to test. Then, we save the file. To verify that the cluster name has been changed, run Elasticsearch as mentioned in the earlier section. Then open http://localhost:9200 in the browser to verify, as shown here: In the preceding screenshot, you can notice that cluster_name has been changed to test, as specified earlier. Changing the Kibana configuration To change the Kibana configuration, follow these steps: Run the following command in the terminal to open the configuration file: sudo vi ~/kibana-4.1.1-linux-x64/config/kibana.yml Windows users can open the kibana.yml file from the config folder In this file, you can change various parameters such as the port on which Kibana works, the host address on which Kibana works, the URL of Elasticsearch that you wish to connect to, and so on For example, the port on which Kibana works can be changed by changing the port address. As shown in the following screenshot, port: 5601 can be changed to any other port, such as port: 5604. Then we save the file. To check whether Kibana is running on port 5604, run Kibana as mentioned earlier. Then open http://localhost:5604 in the browser to verify, as follows: Importing a JSON file into Elasticsearch To import a JSON file into Elasticsearch, we will use the elasticdump package. It is a set of import and export tools used for Elasticsearch. It makes it easier to copy, move, and save indexes. To install elasticdump, we will require npm and Node.js as prerequisites. Installation of npm In this section, npm along with Node.js will be installed. This section covers the installation of npm and Node.js on Ubuntu and Windows separately. Installation of npm on Ubuntu 14.04 To install npm on Ubuntu, perform the following steps: Add the official Node.js PPA: sudo curl --silent --location https://deb.nodesource.com/setup_0.12 | sudo bash - As shown in the preceding screenshot, the command will add the official Node.js repository to the system and update the apt package database to include all the latest files under the packages. At the end of the execution of this command, we will be prompted to install Node.js and npm, as shown in the following screenshot: Install Node.js by entering this command in the terminal: sudo apt-get install --yes nodejs This will automatically install Node.js and npm as npm is bundled within Node.js. To check whether Node.js has been installed successfully, type the following command in the terminal: node –v Upon successful installation, it will display the version of Node.js. Now, to check whether npm has been installed successfully, type the following command in the terminal: npm –v Upon successful installation, it will show the version of npm. Installation of npm on Windows To install npm on Windows, follow these steps: Download the Windows Installer (.msi) file by going to https://nodejs.org/en/download/. Double-click on the downloaded file and keep clicking on Next to install the software. To validate the successful installation of Node.js, right-click and select GIT Bash. In GIT Bash, enter this: node –v Upon successful installation, you will be shown the version of Node.js. To validate the successful installation of npm, right-click and select GIT Bash. In GIT Bash, enter the following line: npm –v Upon successful installation, it will show the version of npm. Installing elasticdump In this section, elasticdump will be installed. It will be used to import a JSON file into Elasticsearch. It requires npm and Node.js installed. This section covers the installation on Ubuntu and Windows separately. Installing elasticdump on Ubuntu 14.04 Perform these steps to install elasticdump on Ubuntu: Install elasticdump by typing the following command in the terminal: sudo npm install elasticdump -g Then run elasticdump by typing this command in the terminal: elasticdump Import a sample data (JSON) file into Elasticsearch, which can be downloaded from https://github.com/guptayuvraj/Kibana_Essentials and is named tweet.json. It will be imported into Elasticsearch using the following command in the terminal: elasticdump --bulk=true --input="/home/yuvraj/Desktop/tweet.json" --output=http://localhost:9200/ Here, input provides the location of the file, as shown in the following screenshot: As you can see, data is being imported to Elasticsearch from the tweet.json file, and the dump complete message is displayed when all the records are imported to Elasticsearch successfully. Elasticsearch should be running while importing the sample file. Installing elasticdump on Windows To install elasticdump on Windows, perform the following steps: Install elasticdump by typing the following command in GIT Bash: npm install elasticdump -g                                                                                                           Then run elasticdump by typing this command in GIT Bash: elasticdump Import the sample data (JSON) file into Elasticsearch, which can be downloaded from https://github.com/guptayuvraj/Kibana_Essentials and is named tweet.json. It will be imported to Elasticsearch using the following command in GIT Bash: elasticdump --bulk=true --input="C:UsersyguptaDesktoptweet.json" --output=http://localhost:9200/ Here, input provides the location of the file. The preceding screenshot shows data being imported to Elasticsearch from the tweet.json file, and the dump complete message is displayed when all the records are imported to Elasticsearch successfully. Elasticsearch should be running while importing the sample file. To verify that the data has been imported to Elasticsearch, open http://localhost:5601 in your browser, and this is what you should see: When Kibana is opened, you have to configure an index pattern. So, if data has been imported, you can enter the index name, which is mentioned in the tweet.json file as index: tweet. After the page loads, you can see to the left under Index Patterns the name of the index that has been imported (tweet). Now mention the index name as tweet. It will then automatically detect the timestamped field and will provide you with an option to select the field. If there are multiple fields, then you can select them by clicking on Time-field name, which will provide a drop-down list of all fields available, as shown here: Finally, click on Create to create the index in Kibana. After you have clicked on Create, it will display the various fields present in this index. If you do not get the options of Time-field name and Create after entering the index name as tweet, it means that the data has not been imported into Elasticsearch. Summary In this article, you learned about Kibana, along with the basic concepts of Elasticsearch. These help in the easy understanding of Kibana. We also looked at the prerequisites for installing Kibana, followed by a detailed explanation of how to install each component individually in Ubuntu and Windows. Resources for Article: Further resources on this subject: Understanding Ranges [article] Working On Your Bot [article] Welcome to the Land of BludBorne [article]
Read more
  • 0
  • 0
  • 5214

article-image-putting-your-database-heart-azure-solutions
Packt
28 Oct 2015
19 min read
Save for later

Putting Your Database at the Heart of Azure Solutions

Packt
28 Oct 2015
19 min read
In this article by Riccardo Becker, author of the book Learning Azure DocumentDB, we will see how to build a real scenario around an Internet of Things scenario. This scenario will build a basic Internet of Things platform that can help to accelerate building your own. In this article, we will cover the following: Have a look at a fictitious scenario Learn how to combine Azure components with DocumentDB Demonstrate how to migrate data to DocumentDB (For more resources related to this topic, see here.) Introducing an Internet of Things scenario Before we start exploring different capabilities to support a real-life scenario, we will briefly explain the scenario we will use throughout this article. IoT, Inc. IoT, Inc. is a fictitious start-up company that is planning to build solutions in the Internet of Things domain. The first solution they will build is a registration hub, where IoT devices can be registered. These devices can be diverse, ranging from home automation devices up to devices that control traffic lights and street lights. The main use case for this solution is offering the capability for devices to register themselves against a hub. The hub will be built with DocumentDB as its core component and some Web API to expose this functionality. Before devices can register themselves, they need to be whitelisted in order to prevent malicious devices to start registering. In the following screenshot, we see the high-level design of the registration requirement: The first version of the solution contains the following components: A Web API containing methods to whitelist, register, unregister, and suspend devices DocumentDB, containing all the device information including information regarding other Microsoft Azure resources Event Hub, a Microsoft Azure asset that enables scalable publish-subscribe mechanism to ingress and egress millions of events per second Power BI, Microsoft’s online offering to expose reporting capabilities and the ability to share reports Obviously, we will focus on the core of the solution which is DocumentDB but it is nice to touch some of the Azure components, as well to see how well they co-operate and how easy it is to set up a demonstration for IoT scenarios. The devices on the left-hand side are chosen randomly and will be mimicked by an emulator written in C#. The Web API will expose the functionality required to let devices register themselves at the solution and start sending data afterwards (which will be ingested to the Event Hub and reported using Power BI). Technical requirements To be able to service potentially millions of devices, it is necessary that registration request from a device is being stored in a separate collection based on the country where the device is located or manufactured. Every device is being modeled in the same way, whereas additional metadata can be provided upon registration or afterwards when updating. To achieve country-based partitioning, we will create a custom PartitionResolver to achieve this goal. To extend the basic security model, we reduce the amount of sensitive information in our configuration files. Enhance searching capabilities because we want to service multiple types of devices each with their own metadata and device-specific information. Querying on all the information is desired to support full-text search and enable users to quickly search and find their devices. Designing the model Every device is being modeled similar to be able to service multiple types of devices. The device model contains at least the deviceid and a location. Furthermore, the device model contains a dictionary where additional device properties can be stored. The next code snippet shows the device model: [JsonProperty("id")]         public string DeviceId { get; set; }         [JsonProperty("location")]         public Point Location { get; set; }         //practically store any metadata information for this device         [JsonProperty("metadata")]         public IDictionary<string, object> MetaData { get; set; } The Location property is of type Microsoft.Azure.Documents.Spatial.Point because we want to run spatial queries later on in this section, for example, getting all the devices within 10 kilometers of a building. Building a custom partition resolver To meet the first technical requirement (partition data based on the country), we need to build a custom partition resolver. To be able to build one, we need to implement the IPartitionResolver interface and add some logic. The resolver will take the Location property of the device model and retrieves the country that corresponds with the latitude and longitude provided upon registration. In the following code snippet, you see the full implementation of the GeographyPartitionResolver class: public class GeographyPartitionResolver : IPartitionResolver     {         private readonly DocumentClient _client;         private readonly BingMapsHelper _helper;         private readonly Database _database;           public GeographyPartitionResolver(DocumentClient client, Database database)         {             _client = client;             _database = database;             _helper = new BingMapsHelper();         }         public object GetPartitionKey(object document)         {             //get the country for this document             //document should be of type DeviceModel             if (document.GetType() == typeof(DeviceModel))             {                 //get the Location and translate to country                 var country = _helper.GetCountryByLatitudeLongitude(                     (document as DeviceModel).Location.Position.Latitude,                     (document as DeviceModel).Location.Position.Longitude);                 return country;             }             return String.Empty;         }           public string ResolveForCreate(object partitionKey)         {             //get the country for this partitionkey             //check if there is a collection for the country found             var countryCollection = _client.CreateDocumentCollectionQuery(database.SelfLink).            ToList().Where(cl => cl.Id.Equals(partitionKey.ToString())).FirstOrDefault();             if (null == countryCollection)             {                 countryCollection = new DocumentCollection { Id = partitionKey.ToString() };                 countryCollection =                     _client.CreateDocumentCollectionAsync(_database.SelfLink, countryCollection).Result;             }             return countryCollection.SelfLink;         }           /// <summary>         /// Returns a list of collectionlinks for the designated partitionkey (one per country)         /// </summary>         /// <param name="partitionKey"></param>         /// <returns></returns>         public IEnumerable<string> ResolveForRead(object partitionKey)         {             var countryCollection = _client.CreateDocumentCollectionQuery(_database.SelfLink).             ToList().Where(cl => cl.Id.Equals(partitionKey.ToString())).FirstOrDefault();               return new List<string>             {                 countryCollection.SelfLink             };         }     } In order to have the DocumentDB client use this custom PartitionResolver, we need to assign it. The code is as follows: GeographyPartitionResolver resolver = new GeographyPartitionResolver(docDbClient, _database);   docDbClient.PartitionResolvers[_database.SelfLink] = resolver; //Adding a typical device and have the resolver sort out what //country is involved and whether or not the collection already //exists (and create a collection for the country if needed), use //the next code snippet. var deviceInAmsterdam = new DeviceModel             {                 DeviceId = Guid.NewGuid().ToString(),                 Location = new Point(4.8951679, 52.3702157)             };   Document modelAmsDocument = docDbClient.CreateDocumentAsync(_database.SelfLink,                 deviceInAmsterdam).Result;             //get all the devices in Amsterdam            var doc = docDbClient.CreateDocumentQuery<DeviceModel>(                 _database.SelfLink, null, resolver.GetPartitionKey(deviceInAmsterdam)); Now that we have created a country-based PartitionResolver, we can start working on the Web API that exposes the registration method. Building the Web API A Web API is an online service that can be used by any clients running any framework that supports the HTTP programming stack. Currently, REST is a way of interacting with APIs so that we will build a REST API. Building a good API should aim for platform independence. A well-designed API should also be able to extend and evolve without affecting existing clients. First, we need to whitelist the devices that should be able to register themselves against our device registry. The whitelist should at least contain a device ID, a unique identifier for a device that is used to match during the whitelisting process. A good candidate for a device ID is the mac address of the device or some random GUID. Registering a device The registration Web API contains a POST method that does the actual registration. First, it creates access to an Event Hub (not explained here) and stores the credentials needed inside the DocumentDB document. The document is then created inside the designated collection (based on the location). To learn more about Event Hubs, please visit https://azure.microsoft.com/en-us/services/event-hubs/.  [Route("api/registration")]         [HttpPost]         public async Task<IHttpActionResult> Post([FromBody]DeviceModel value)         {             //add the device to the designated documentDB collection (based on country)             try             { var serviceUri = ServiceBusEnvironment.CreateServiceUri("sb", serviceBusNamespace,                     String.Format("{0}/publishers/{1}", "telemetry", value.DeviceId))                     .ToString()                     .Trim('/');                 var sasToken = SharedAccessSignatureTokenProvider.GetSharedAccessSignature(EventHubKeyName,                     EventHubKey, serviceUri, TimeSpan.FromDays(365 * 100)); // hundred years will do                 //this token can be used by the device to send telemetry                 //this token and the eventhub name will be saved with the metadata of the document to be saved to DocumentDB                 value.MetaData.Add("Namespace", serviceBusNamespace);                 value.MetaData.Add("EventHubName", "telemetry");                 value.MetaData.Add("EventHubToken", sasToken);                 var document = await docDbClient.CreateDocumentAsync(_database.SelfLink, value);                 return Created(document.ContentLocation, value);            }             catch (Exception ex)             {                 return InternalServerError(ex);             }         } After this registration call, the right credentials on the Event Hub have been created for this specific device. The device is now able to ingress data to the Event Hub and have consumers like Power BI consume the data and present it. Event Hubs is a highly scalable publish-subscribe event ingestor. It can collect millions of events per second so that you can process and analyze the massive amounts of data produced by your connected devices and applications. Once collected into Event Hubs, you can transform and store the data by using any real-time analytics provider or with batching/storage adapters. At the time of writing, Microsoft announced the release of Azure IoT Suite and IoT Hubs. These solutions offer internet of things capabilities as a service and are well-suited to build our scenario as well. Increasing searching We have seen how to query our documents and retrieve the information we need. For this approach, we need to understand the DocumentDB SQL language. Microsoft has an online offering that enables full-text search called Azure Search service. This feature enables us to perform full-text searches and it also includes search behaviours similar to search engines. We could also benefit from so called type-ahead query suggestions based on the input of a user. Imagine a search box on our IoT Inc. portal that offers free text searching while the user types and search for devices that include any of the search terms on the fly. Azure Search runs on Azure; therefore, it is scalable and can easily be upgraded to offer more search and storage capacity. Azure Search stores all your data inside an index, offering full-text search capabilities on your data. Setting up Azure Search Setting up Azure Search is pretty straightforward and can be done by using the REST API it offers or on the Azure portal. We will set up the Azure Search service through the portal and later on, we will utilize the REST API to start configuring our search service. We set up the Azure Search service through the Azure portal (http://portal.azure.com). Find the Search service and fill out some information. In the following screenshot, we can see how we have created the free tier for Azure Search: You can see that we use the Free tier for this scenario and that there are no datasources configured yet. We will do that know by using the REST API. We will use the REST API, since it offers more insight on how the whole concept works. We use Fiddler to create a new datasource inside our search environment. The following screenshot shows how to use Fiddler to create a datasource and add a DocumentDB collection: In the Composer window of Fiddler, you can see we need to POST a payload to the Search service we created earlier. The Api-Key is mandatory and also set the content type to be JSON. Inside the body of the request, the connection information to our DocumentDB environment is need and the collection we want to add (in this case, Netherlands). Now that we have added the collection, it is time to create an Azure Search index. Again, we use Fiddler for this purpose. Since we use the free tier of Azure Search, we can only add five indexes at most. For this scenario, we add an index on ID (device ID), location, and metadata. At the time of writing, Azure Search does not support complex types. Note that the metadata node is represented as a collection of strings. We could check in the portal to see if the creation of the index was successful. Go to the Search blade and select the Search service we have just created. You can check the indexes part to see whether the index was actually created. The next step is creating an indexer. An indexer connects the index with the provided data source. Creating this indexer takes some time. You can check in the portal if the indexing process was successful. We actually find that documents are part of the index now. If your indexer needs to process thousands of documents, it might take some time for the indexing process to finish. You can check the progress of the indexer using the REST API again. https://iotinc.search.windows.net/indexers/deviceindexer/status?api-version=2015-02-28 Using this REST call returns the result of the indexing process and indicates if it is still running and also shows if there are any errors. Errors could be caused by documents that do not have the id property available. The final step involves testing to check whether the indexing works. We will search for a device ID, as shown in the next screenshot: In the Inspector tab, we can check for the results. It actually returns the correct document also containing the location field. The metadata is missing because complex JSON is not supported (yet) at the time of writing. Indexing complex JSON types is not supported yet. It is possible to add SQL queries to the data source. We could explicitly add a SELECT statement to surface the properties of the complex JSON we have like metadata or the Point property. Try adding additional queries to your data source to enable querying complex JSON types. Now that we have created an Azure Search service that indexes our DocumentDB collection(s), we can build a nice query-as-you-type field on our portal. Try this yourself. Enhancing security Microsoft Azure offers a capability to move your secrets away from your application towards Azure Key Vault. Azure Key Vault helps to protect cryptographic keys, secrets, and other information you want to store in a safe place outside your application boundaries (connectionstring are also good candidates). Key Vault can help us to protect the DocumentDB URI and its key. DocumentDB has no (in-place) encryption feature at the time of writing, although a lot of people already asked for it to be on the roadmap. Creating and configuring Key Vault Before we can use Key Vault, we need to create and configure it first. The easiest way to achieve this is by using PowerShell cmdlets. Please visit https://msdn.microsoft.com/en-us/mt173057.aspx to read more about PowerShell. The following PowerShell cmdlets demonstrate how to set up and configure a Key Vault: Command Description Get-AzureSubscription This command will prompt you to log in using your Microsoft Account. It returns a list of all Azure subscriptions that are available to you. Select-AzureSubscription -SubscriptionName "Windows Azure MSDN Premium" This tells PowerShell to use this subscription as being subject to our next steps. Switch-AzureMode AzureResourceManager New-AzureResourceGroup –Name 'IoTIncResourceGroup' –Location 'West Europe' This creates a new Azure Resource Group with a name and a location. New-AzureKeyVault -VaultName 'IoTIncKeyVault' -ResourceGroupName 'IoTIncResourceGroup' -Location 'West Europe' This creates a new Key Vault inside the resource group and provide a name and location. $secretvalue = ConvertTo-SecureString '<DOCUMENTDB KEY>' -AsPlainText –Force This creates a security string for my DocumentDB key. $secret = Set-AzureKeyVaultSecret -VaultName 'IoTIncKeyVault' -Name 'DocumentDBKey' -SecretValue $secretvalue This creates a key named DocumentDBKey into the vault and assigns it the secret value we have just received. Set-AzureKeyVaultAccessPolicy -VaultName 'IoTIncKeyVault' -ServicePrincipalName <SPN> -PermissionsToKeys decrypt,sign This configures the application with the Service Principal Name <SPN> to get the appropriate rights to decrypt and sign Set-AzureKeyVaultAccessPolicy -VaultName 'IoTIncKeyVault' -ServicePrincipalName <SPN> -PermissionsToSecrets Get This configures the application with SPN to also be able to get a key. Key Vault must be used together with Azure Active Directory to work. The SPN we need in the steps for powershell is actually is a client ID of an application I have set up in my Azure Active Directory. Please visit https://azure.microsoft.com/nl-nl/documentation/articles/active-directory-integrating-applications/ to see how you can create an application. Make sure to copy the client ID (which is retrievable afterwards) and the key (which is not retrievable afterwards). We use these two pieces of information to take the next step. Using Key Vault from ASP.NET In order to use the Key Vault we have created in the previous section, we need to install some NuGet packages into our solution and/or projects: Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory -Version 2.16.204221202   Install-Package Microsoft.Azure.KeyVault These two packages enable us to use AD and Key Vault from our ASP.NET application. The next step is to add some configuration information to our web.config file: <add key="ClientId" value="<CLIENTID OF THE APP CREATED IN AD" />     <add key="ClientSecret" value="<THE SECRET FROM AZURE AD PORTAL>" />       <!-- SecretUri is the URI for the secret in Azure Key Vault -->     <add key="SecretUri" value="https://iotinckeyvault.vault.azure.net:443/secrets/DocumentDBKey" /> If you deploy the ASP.NET application to Azure, you could even configure these settings from the Azure portal itself, completely removing this from the web.config file. This technique adds an additional ring of security around your application. The following code snippet shows how to use AD and Key Vault inside the registration functionality of our scenario: //no more keys in code or .config files. Just a appid, secret and the unique URL to our key (SecretUri). When deploying to Azure we could             //even skip this by setting appid and clientsecret in the Azure Portal.             var kv = new KeyVaultClient(new KeyVaultClient.AuthenticationCallback(Utils.GetToken));             var sec = kv.GetSecretAsync(WebConfigurationManager.AppSettings["SecretUri"]).Result.Value; The Utils.GetToken method is shown next. This method retrieves an access token from AD by supplying the ClientId and the secret. Since we configured Key Vault to allow this application to get the keys, the call to GetSecretAsync() will succeed. The code is as follows: public async static Task<string> GetToken(string authority, string resource, string scope)         {             var authContext = new AuthenticationContext(authority);             ClientCredential clientCred = new ClientCredential(WebConfigurationManager.AppSettings["ClientId"],                         WebConfigurationManager.AppSettings["ClientSecret"]);             AuthenticationResult result = await authContext.AcquireTokenAsync(resource, clientCred);               if (result == null)                 throw new InvalidOperationException("Failed to obtain the JWT token");             return result.AccessToken;         } Instead of storing the key to DocumentDB somewhere in code or in the web.config file, it is now moved away to Key Vault. We could do the same with the URI to our DocumentDB and with other sensitive information as well (for example, storage account keys or connection strings). Encrypting sensitive data The documents we created in the previous section contains sensitive data like namespaces, Event Hub names, and tokens. We could also use Key Vault to encrypt those specific values to enhance our security. In case someone gets hold of a document containing the device information, he is still unable to mimic this device since the keys are encrypted. Try to use Key Vault to encrypt the sensitive information that is stored in DocumentDB before it is saved in there. Migrating data This section discusses how to use a tool to migrate data from an existing data source to DocumentDB. For this scenario, we assume that we already have a large datastore containing existing devices and their registration information (Event Hub connection information). In this section, we will see how to migrate an existing data store to our new DocumentDB environment. We use the DocumentDB Data Migration Tool for this. You can download this tool from the Microsoft Download Center (http://www.microsoft.com/en-us/download/details.aspx?id=46436) or from GitHub if you want to check the code. The tool is intuitive and enables us to migrate from several datasources: JSON files MongoDB SQL Server CSV files Azure Table storage Amazon DynamoDB HBase DocumentDB collections To demonstrate the use, we migrate our existing Netherlands collection to our United Kingdom collection. Start the tool and enter the right connection string to our DocumentDB database. We do this for both our source and target information in the tool. The connection strings you need to provide should look like this: AccountEndpoint=https://<YOURDOCDBURL>;AccountKey=<ACCOUNTKEY>;Database=<NAMEOFDATABASE>. You can click on the Verify button to make sure these are correct. In the Source Information field, we provide the Netherlands as being the source to pull data from. In the Target Information field, we specify the United Kingdom as the target. In the following screenshot, you can see how these settings are provided in the migration tool for the source information: The following screenshot shows the settings for the target information: It is also possible to migrate data to a collection that is not created yet. The migration tool can do this if you enter a collection name that is not available inside your database. You also need to select the pricing tier. Optionally, setting the partition key could help to distribute your documents based on this key across all collections you add in this screen. This information is sufficient to run our example. Go to the Summary tab and verify the information you entered. Press Import to start the migration process. We can verify a successful import on the Import results pane. This example is a simple migration scenario but the tool is also capable of using complex queries to only migrate those documents that need to moved or migrated. Try migrating data from an Azure Table storage table to DocumentDB by using this tool. Summary In this article, we saw how to integrate DocumentDB with other Microsoft Azure features. We discussed how to setup the Azure Search service and how create an index to our collection. We also covered how to use the Azure Search feature to enable full-text search on our documents which could enable users to query while typing. Next, we saw how to add additional security to our scenario by using Key Vault. We also discussed how to create and configure Key Vault by using PowerShell cmdlets, and we saw how to enable our ASP.NET scenario application to make use of the Key Vault .NET SDK. Then, we discussed how to retrieve the sensitive information from Key Vault instead of configuration files. Finally, we saw how to migrate an existing data source to our collection by using the DocumentDB Data Migration Tool. Resources for Article: Further resources on this subject: Microsoft Azure – Developing Web API For Mobile Apps [article] Introduction To Microsoft Azure Cloud Services [article] Security In Microsoft Azure [article]
Read more
  • 0
  • 0
  • 27672

article-image-making-3d-visualizations
Packt
26 Oct 2015
5 min read
Save for later

Making 3D Visualizations

Packt
26 Oct 2015
5 min read
 Python has become the preferred language of data scientists for data analysis, visualization, and machine learning. It features numerical and mathematical toolkits such as: Numpy, Scipy, Sci-kit learn, Matplotlib and Pandas, as well as a R-like environment with IPython, all used for data analysis, visualization and machine learning. In this article by Dimitry Foures and Giuseppe Vettigli, authors of the book Python Data Visualization Cookbook, Second Edition, we will see how visualization in 3D is sometimes effective and sometimes inevitable. In this article, you will learn the how 3D bars are created. (For more resources related to this topic, see here.) Creating 3D bars Although matplotlib is mainly focused on plotting and 2D, there are different extensions that enable us to plot over geographical maps, to integrate more with Excel, and plot in 3D. These extensions are called toolkits in matplotlib world. A toolkit is a collection of specific functions that focuses on one topic, such as plotting in 3D. Popular toolkits are Basemap, GTK Tools, Excel Tools, Natgrid, AxesGrid, and mplot3d. We will explore more of mplot3d in this recipe. The mpl_toolkits.mplot3d toolkit provides some basic 3D plotting. Plots supported are scatter, surf, line, and mesh. Although this is not the best 3D plotting library, it comes with matplotlib, and we are already familiar with this interface.   Getting ready Basically, we still need to create a figure and add desired axes to it. Difference is that we specify 3D projection for the figure, and the axes we add is Axes3D. Now, we can almost use the same functions for plotting. Of course, the difference is the arguments passed. For we now have three axes, which we need to provide data for. For example, the mpl_toolkits.mplot3d.Axes3D.plot function specifies the xs, ys, zs, and zdir arguments. All others are transferred directly to matplotlib.axes.Axes.plot. We will explain these specific arguments: xs,ys: These are coordinates for X and Y axis zs: These are value(s) for Z axis. Can be one for all points, or one for each point zdir: These values choose what will be the z-axis dimension (usually this is zs, but can be xs, or ys) There is a rotate_axes method in module mpl_toolkits.mplot3d.art3d that contains 3D artist code and functions to convert 2D artists into 3D versions, which can be added to an Axes3D to reorder coordinates so that the axes are rotated with zdir along. The default value is z. Prepending the axis with a '-' does the inverse transform, so zdir can be x, -x, y, -y, z, or -z. How to do it... This is the code to demonstrate the plotting concept explained in the preceding section: import random import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.dates as mdates from mpl_toolkits.mplot3d import Axes3D mpl.rcParams['font.size'] = 10 fig = plt.figure() ax = fig.add_subplot(111, projection='3d') for z in [2011, 2012, 2013, 2014]: xs = xrange(1,13) ys = 1000 * np.random.rand(12) color = plt.cm.Set2(random.choice(xrange(plt.cm.Set2.N))) ax.bar(xs, ys, zs=z, zdir='y', color=color, alpha=0.8) ax.xaxis.set_major_locator(mpl.ticker.FixedLocator(xs)) ax.yaxis.set_major_locator(mpl.ticker.FixedLocator(ys)) ax.set_xlabel('Month') ax.set_ylabel('Year') ax.set_zlabel('Sales Net [usd]') plt.show() This code produces the following figure: How it works... We had to do the same prep work as in 2D world. Difference here is that we needed to specify what "kind of backend." Then, we generate random data for supposed 4 years of sale (2011–2014). We needed to specify Z values to be the same for the 3D axis. The color we picked randomly from the color map set, and then we associated each Z order collection of xs, ys pairs we would render the bar series. There's more... Other plotting from 2D matplotlib are available here. For example, scatter() has a similar interface to plot(), but with added size of the point marker. We are also familiar with contour, contourf, and bar. New types that are available only in 3D are wireframe, surface, and tri-surface plots. For example, this code example, plots tri-surface plot of popular Pringle functions or, more mathematically, hyperbolic paraboloid:  from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm import matplotlib.pyplot as plt import numpy as np n_angles = 36 n_radii = 8 # An array of radii # Does not include radius r=0, this is to eliminate duplicate points radii = np.linspace(0.125, 1.0, n_radii) # An array of angles angles = np.linspace(0, 2*np.pi, n_angles, endpoint=False) # Repeat all angles for each radius angles = np.repeat(angles[...,np.newaxis], n_radii, axis=1) # Convert polar (radii, angles) coords to cartesian (x, y) coords # (0, 0) is added here. There are no duplicate points in the (x, y) plane x = np.append(0, (radii*np.cos(angles)).flatten()) y = np.append(0, (radii*np.sin(angles)).flatten()) # Pringle surface z = np.sin(-x*y) fig = plt.figure() ax = fig.gca(projection='3d') ax.plot_trisurf(x, y, z, cmap=cm.jet, linewidth=0.2) plt.show()  The code will give the following output:    Summary Python Data Visualization Cookbook, Second Edition, is for developers that already know about Python programming in general. If you have heard about data visualization but you don't know where to start, then the book will guide you from the start and help you understand data, data formats, data visualization, and how to use Python to visualize data. Many more visualization techniques have been illustrated in a step-by-step recipe-based approach to data visualization in the book. The topics are explained sequentially as cookbook recipes consisting of a code snippet and the resulting visualization. Resources for Article: Further resources on this subject: Basics of Jupyter Notebook and Python [article] Asynchronous Programming with Python [article] Introduction to Data Analysis and Libraries [article]
Read more
  • 0
  • 0
  • 3644
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-configuring-brokers
Packt
21 Oct 2015
18 min read
Save for later

Configuring Brokers

Packt
21 Oct 2015
18 min read
In this article by Saurabh Minni, author of Apache Kafka Cookbook, we will cover the following topics: Configuring basic settings Configuring threads and performance Configuring log settings Configuring replica settings Configuring the ZooKeeper settings Configuring other miscellaneous parameters (For more resources related to this topic, see here.) This article explains the configurations of a Kafka broker. Before we get started with Kafka, it is critical to configure it to suit us best. The best part about Kafka is that it's highly configurable. Though, most of the time you will be good to go with the default settings in place, when dealing with scale and performance, you might want to get your hands with a configuration that suits  your application best. Configuring basic settings Let's configure the basic settings for your Apache Kafka broker. Getting ready I believe you have already Kafka installed. Make a copy of the server.properties file from the config folder. Now, let's get cracking with your favorite editor. How to do it... Open your server.properties file: The first configuration that you need to change is broker.id: broker.id=0 Next, give a host name to your machine: host.name=localhost You also need to set the port number: to listen to. port=9092 Lastly, the directory for data persistence is as follows: log.dirs=/disk1/kafka-logs How it works… With these basic configuration parameters in place, your Kafka broker is ready to be setup. All you need to do is pass on this new configuration file when you start the broker as a parameter. Some of the important configurations used in the configuration files are explained here: broker.id: This should be a nonnegative integer ID. It should be unique for a cluster as it is used for all intents, purposes, and names of brokers. This also allows the broker to be moved to a different host and/or port without additional changes on the side of a consumer. Its default value is 0. host.name: The refers to the default value, which is null. If it's not specified, Kafka will bind to all interfaces in a system. If it's specified, it will bind only to a particular address. If you want clients to connect only to a particular interface, it is a good idea to specify the host name. port: This defines the port number that the Kafka broker will be listening to, to accept client connections. log.dirs: This tells the broker the directory where it should store files for the persistence of messages. You can specify multiple directories here using commas that separate locations. The default value for this is /tmp/kafka-logs. There's more… Kafka also lets you specify two more parameters, which are very interesting: advertised.host.name: This is the hostname that is given out to producers, consumers, and other brokers to connect to. Usually, this is the same as host.name and you need not specify it. advertised.port: This specifies the port that other producers, consumers, and brokers need to connect to. If not specified, it uses the one mentioned in the port configuration parameters. The real use case of the preceding parameters is when you make use of bridged connections where your internal host.name and port number might be different from one that external parties need to connect to. Configuring threads and performance When using Kafka, these settings are something you need not modify. However, when you want to extract every last bit of performance from your machines, this comes in handy. Getting ready You are all set with your broker properties file and are set to edit it in your favorite editor. How to do it... Open your server.properties file. Change message.max.bytes: message.max.bytes=1000000 Set the number of network threads: num.network.threads=3 Set the number of IO threads: num.io.threads=8 Set the number of threads that perform background processing: background.threads=10 Set the maximum number of requests to be queued up: queued.max.requests=500 Set the send socket buffer size: socket.send.buffer.bytes=102400 Set the receive socket buffer size: socket.receive.buffer.bytes=102400 Set the maximum request size: socket.request.max.bytes=104857600 Set the number of partitions: num.partitions=1 How it works… These network and performance configurations are set to an optimal level for your application. You might need to experiment a little to come up with an optimal configuration. Here are some explanations for these confugurations: message.max.bytes: This sets the maximum size of the message that a server can receive. This should be set in order to prevent any producer from inadvertently sending extra large messages and swamping consumers. The default size should be set to 1000000. num.network.threads: This sets the number of threads running to handle a network request. If you have too many requests coming in, you need to change this value. Else, you are good to go in most use cases. The default value for this should be set to 3. num.io.threads: This sets the number of threads that are spawned for IO operations. This should be set to at least the number of disks that are present. The default value for this should be set to 8. background.threads: This sets the number of threads that run various background jobs. These include deleting old log files. The default value is 10 and you might not need to change this. queued.max.requests: This sets the size of the queue that holds pending messages, while others are processed by IO threads. If the queue is full, the network threads will stop accepting any more messages. If you have erratic loads in your application, you need to set it to some value at which this does not throttle. socket.send.buffer.bytes: This sets the SO_SNDBUFF buffer size, which is used for socket connections. socket.receive.buffer.bytes: This sets the SO_RCVBUFF buffer size, which is used for socket connections. socket.request.max.bytes: This sets the maximum request size for a server to receive. This should be smaller than the Java heap size that you have set. num.partitions: This sets the number of default partitions for any topic you create without explicitly mentioning any partition size. There's more You might also need to configure your Java installation for maximum performance. This includes settings for heap, socket size, and so on. Configuring log settings Log settings are perhaps the most important configurations that you need to change based on your system requirements. Getting ready Just open the server.properties file in your favorite editor. How to do it... Open your server.properties file. Here are the default values for it: Change the log.segment.bytes value: log.segment.bytes=1073741824 Set the log.roll.{ms,hours} value: log.roll.{ms,hours}=168 hours Set the log.cleanup.policy value: log.cleanup.policy=delete Set the log.retention.{ms,minutes,hours} value: log.retention.{ms,minutes,hours}=168 hours Set the log.retention.bytes value: log.retention.bytes=-1 Set the log.retention.check.interval.ms value: log.retention.check.interval.ms= 30000 Set the log.cleaner.enable value: log.cleaner.enable=false Set the log.cleaner.threads value: log.cleaner.threads=1 Set the log.cleaner.backoff.ms value: log.cleaner.backoff.ms=15000 Set the log.index.size.max.bytes value: log.index.size.max.bytes=10485760 Set the log.index.interval.bytes value: log.index.interval.bytes=4096 Set the log.flush.interval.messages value: log.flush.interval.messages=Long.MaxValue Set the log.flush.interval.ms value: log.flush.interval.ms=Long.MaxValue How it works… Here is the explanation of log settings: log.segment.bytes: This defines the maximum segment size in bytes. Once a segment reaches a particular size, a new segment file is created. A topic is stored as a bunch of segment files in a directory. This can also be set on a per topic basis. Its default value is 1 GB. log.roll.{ms,hours}: This sets the time period after which a new segment file is created even if it has not reached the required size limit. This setting can also be set on a per topic basis. Its default value is 7 days. log.cleanup.policy: The value for this can be either deleted or compacted. With the delete option set, log segments are deleted periodically when it reaches its time threshold or size limit. If a compact option is set, log compaction will be used to clean up obsolete records. This setting can be set on a per topic basis. log.retention.{ms,minutes,hours}: This sets the amount of time that logs segments are retained. This can be set on a per topic basis. The default value for this is 7 days. log.retention.bytes: This sets the maximum number of byte logs per partition that are retained before they are deleted. This value can be set for a per topic basis. When either of the log time or size limits are reached, segments are deleted. log.retention.check.interval.ms: This sets the time interval at which logs are checked for deletion to meet retention policies. The default value for this is 5 minutes. log.cleaner.enable: For log compaction to be enabled, this has to be set as true. log.cleaner.threads: This sets the number of threads that work to clean logs for compaction. log.cleaner.backoff.ms: This defines the interval at which logs check if any other logs need cleaning. log.index.size.max.bytes: This settings sets the maximum size allowed for the offset index of each log segment. This can be set for per topic basis as well. log.index.interval.bytes: This defines the byte interval at which a new entry is added to the offset index. For each fetch request, the broker performs a linear scan for a particular number of bytes to find the correct position in the log to begin and end a fetch. Setting this as a larger value will mean larger index files (and a bit more memory usage) but less scanning. log.flush.interval.messages: This is the number of messages that are kept in memory till they're flushed to the disk. Though this does not guarantee durability, it gives finer control. log.flush.interval.ms: This sets the time interval at which the messages are flushed to the disk. There's more Some other settings are listed at http://kafka.apache.org/documentation.html#brokerconfigs. See also More on log compassion is available at http://kafka.apache.org/documentation.html#compaction. Configuring replica settings You will also want set up a replica for reliability purposes. Let's see some of the important settings that you need to handle for replication to work best for you. Getting ready Open the server.properties file in your favorite editor. How to do it... Open your server.properties file. Here are default values for the settings: Set the default.replication.factor value: default.replication.factor=1 Set the replica.lag.time.max.ms value: replica.lag.time.max.ms=10000 Set the replica.lag.max.messages value: replica.lag.max.messages=4000 Set the replica.fetch.max.bytes value: replica.fetch.max.bytes=1048576 Set the replica.fetch.wait.max.ms value: replica.fetch.wait.max.ms=500 Set the num.replica.fetchers value: num.replica.fetchers=1 Set the replica.high.watermark.checkpoint.interval.ms value: replica.high.watermark.checkpoint.interval.ms=5000 Set the fetch.purgatory.purge.interval.requests value: fetch.purgatory.purge.interval.requests=1000 Set the producer.purgatory.purge.interval.requests value: producer.purgatory.purge.interval.requests=1000 Set the replica.socket.timeout.ms value: replica.socket.timeout.ms=30000 Set the replica.socket.receive.buffer.bytes value: replica.socket.receive.buffer.bytes=65536 How it works… Here is the explanation of the preceding settings: default.replication.factor: This sets the default replication factor for automatically created topics. replica.lag.time.max.ms: This is time period within which if a leader does not receive any fetch request, its moved out of in-sync replicas and is treated as dead. replica.lag.max.messages: This is maximum number of messages a follower can be behind the leader by before it is considered dead and not in-sync. replica.fetch.max.bytes: This sets the maximum number of bytes of data that a follower will fetch in a request from its leader. replica.fetch.wait.max.ms: This sets the maximum amount of time for the leader to respond to a replica's fetch request. num.replica.fetchers: This specifies the number of threads used to replicate messages from the leader. Increasing the number of threads increases the IO rate to a degree. replica.high.watermark.checkpoint.interval.ms: This specifies the frequency with which each replica saves its high watermark to disk for recovery. fetch.purgatory.purge.interval.requests: This sets the fetch request purgatory's purge interval. This purgatory is the place where the fetch requests are kept on hold till they can be serviced. producer.purgatory.purge.interval.requests: This sets the producer request purgatory's purge interval. This purgatory is the place where the producer requests are kept on hold till they have been serviced. There's more Some other settings are listed at http://kafka.apache.org/documentation.html#brokerconfigs. Configuring the ZooKeeper settings ZooKeeper is used in Kafka for cluster management and to maintain the details of topics. Getting ready Just open the server.properties file in your favorite editor. How to do it… Open your server.properties file. Here are the default values for the settings: Set the zookeeper.connect property: zookeeper.connect=127.0.0.1:2181,192.168.0.32:2181 Set the zookeeper.session.timeout.ms property: zookeeper.session.timeout.ms=6000 Set the zookeeper.connection.timeout.ms property: zookeeper.connection.timeout.ms=6000 Set the zookeeper.sync.time.ms property: zookeeper.sync.time.ms=2000 How it works… Here is the explanation of these settings: zookeeper.connect: This is where you specify the ZooKeeper connection string in the form of hostname:port. You can use comma-separated values to specify multiple ZooKeeper nodes. This ensures reliability and continuity of Kafka clusters even in the event of a ZooKeeper node being down. ZooKeeper allows you to use the chroot path to make a particular Kafka data available only under a particular path. This enables you to have the same ZooKeeper clusters support multiple Kafka clusters. Here is the method to specify connection a string in this case: host1:port1,host2:port2,host3:port3/chroot/path The preceding statement puts all the cluster data in the /chroot/path path. This path must be created prior to starting Kafka clusters and users must use the same string. zookeeper.session.timeout.ms: This specifies the time within which if the heartbeat from a server is not received, then it is considered dead. The value for this must be carefully selected because if this heartbeat has too long an interval, it will not be able to detect a dead server in time and also lead to issues. Also, if the time period is too small, a live server might be considered dead. zookeeper.connection.timeout.ms: This specifies the maximum connection time that a client waits to accept a connection. zookeeper.sync.time.ms property: This specifies the time period by which a ZooKeeper follower can be behind its leader The ZooKeeper management details from the Kafka perspective are highlighted at http://kafka.apache.org/documentation.html#zk. You can find ZooKeeper at https://zookeeper.apache.org/ See also Configuring other miscellaneous parameters Besides the configurations mentioned previously, there are some other configurations that also need to be set. Getting ready Open the server.properties file in your favorite editor. We will look at the default values of the properties in the following section. How to do it... Set the auto.create.topics.enable property: auto.create.topics.enable=true Set the controlled.shutdown.enable property: controlled.shutdown.enable=true Set the controlled.shutdown.max.retries property: controlled.shutdown.max.retries=3 Set the controlled.shutdown.retry.backoff.ms property: controlled.shutdown.retry.backoff.ms=5000 Set the auto.leader.rebalance.enable property: auto.leader.rebalance.enable=true Set the leader.imbalance.per.broker.percentage property: leader.imbalance.per.broker.percentage=10 Set the leader.imbalance.check.interval.seconds property: leader.imbalance.check.interval.seconds=300 Set the offset.metadata.max.bytes property: offset.metadata.max.bytes=4096 Set the max.connections.per.ip property: max.connections.per.ip=Int.MaxValue Set the connections.max.idle.ms property: connections.max.idle.ms=600000 Set the unclean.leader.election.enable property: unclean.leader.election.enable=true Set the offsets.topic.num.partitions property: offsets.topic.num.partitions=50 Set the offsets.topic.retention.minutes property: offsets.topic.retention.minutes=1440 Set the offsets.retention.check.interval.ms property: offsets.retention.check.interval.ms=600000 Set the offsets.topic.replication.factor property: offsets.topic.replication.factor=3 Set the offsets.topic.segment.bytes property: offsets.topic.segment.bytes=104857600 Set the offsets.load.buffer.size property: offsets.load.buffer.size=5242880 Set the offsets.commit.required.acks property: offsets.commit.required.acks=-1 Set the offsets.commit.timeout.ms property: offsets.commit.timeout.ms=5000 How it works… An explanation of the settings is as follows. auto.create.topics.enable: Setting this value to true will make sure that if you fetch metadata or produce messages for a nonexistent topic, it will automatically be created. Ideally, in a production environment, you should set this value to false. controlled.shutdown.enable: This is set to true to make sure that when shutdown is called on the broker, if it's the leader of any topic, then it gracefully moves all leaders to a different broker before it shuts down. This increases the availability of the system overall. controlled.shutdown.max.retries: This sets the maximum number of retries that the broker makes to perform a controlled shutdown before performing an unclean one. controlled.shutdown.retry.backoff.ms: This sets the backoff time between controlled shutdown retries. auto.leader.rebalance.enable: If this is set to true, the broker will automatically try to balance the leadership of partitions among other brokers by periodically giving leadership to the preferred replica of each partition if it's available. leader.imbalance.per.broker.percentage: This sets the percentage of leader imbalance that's allowed per broker. The cluster will rebalance the leadership if this ratio goes above the set value. leader.imbalance.check.interval.seconds: This defines the time period for checking leader imbalance. offset.metadata.max.bytes: This defines the maximum amount of metadata allowed to the client to be stored with their offset. max.connections.per.ip: This sets the maximum number of connections that the broker accepts from a given IP address. connections.max.idle.ms: This sets the maximum time till which the broker will be idle before it closes a socket connection unclean.leader.election.enable: This is set to true to allow replicas that are not in-sync replicas (ISR) in order to be allowed to become the leader. This can lead to data loss. This is the last option for the cluster, though. offsets.topic.num.partitions: This sets the number of partitions for the offset commit topic. This cannot be changed post deployment, so its suggested that the number be set to a higher limit. The default value for this is 50. offsets.topic.retention.minutes: This sets offsets that are older than present time be marked for deletion. Actual deletion occurs when a log cleaner run the compaction of an offset topic. offsets.retention.check.interval.ms: This sets the time interval for the checking of stale offsets. offsets.topic.replication.factor: This sets the replication factor for the offset commit topic. The higher the value, the higher the availability. If at the time of creation of an offset topic, the number of brokers is lower than the replication factor, the number of replicas created will be equal to the brokers. offsets.topic.segment.bytes: This sets the segment size for offset topics. This, if kept low, leads to faster log compaction and loads. offsets.load.buffer.size: This sets the buffer size that's to be used for reading offset segments into offset manager's cache. offsets.commit.required.acks: This sets the number of acknowledgements that are required before an offset commit can be accepted. offsets.commit.timeout.ms: This sets the time after which an offset commit will be performed in case the required number of replicas have not received the offset commit. See also There are more broker configurations that are available. Read more about them at http://kafka.apache.org/documentation.html#brokerconfigs. Summary In this article, we discussed setting basic configurations for the Kafka broker, configuring and managing threads, performance, logs, and replicas. We also discussed ZooKeeper settings that are used for cluster management and some miscellaneous parameter settings. Resources for Article: Further resources on this subject: Writing Consumers [article] Introducing Kafka [article] Testing With Groovy [article]
Read more
  • 0
  • 0
  • 2694

article-image-qlikview-tips-and-tricks
Packt
20 Oct 2015
6 min read
Save for later

QlikView Tips and Tricks

Packt
20 Oct 2015
6 min read
In this article by Andrew Dove and Roger Stone, author of the book QlikView Unlocked, we will cover the following key topics: A few coding tips The surprising data sources Include files Change logs (For more resources related to this topic, see here.) A few coding tips There are many ways to improve things in QlikView. Some are techniques and others are simply useful things to know or do. Here are a few of our favourite ones. Keep the coding style constant There's actually more to this than just being a tidy developer. So, always code your function names in the same way—it doesn't matter which style you use (unless you have installation standards that require a particular style). For example, you could use MonthStart(), monthstart(), or MONTHSTART(). They're all equally valid, but for consistency, choose one and stick to it. Use MUST_INCLUDE rather than INCLUDE This feature wasn't documented at all until quite a late service release of v11.2; however, it's very useful. If you use INCLUDE and the file you're trying to include can't be found, QlikView will silently ignore it. The consequences of this are unpredictable, ranging from strange behaviour to an outright script failure. If you use MUST_INCLUDE, QlikView will complain that the included file is missing, and you can fix the problem before it causes other issues. Actually, it seems strange that INCLUDE doesn't do this, but Qlik must have its reasons. Nevertheless, always use MUST_INCLUDE to save yourself some time and effort. Put version numbers in your code QlikView doesn't have a versioning system as such, and we have yet to see one that works effectively with QlikView. So, this requires some effort on the part of the developer. Devise a versioning system and always place the version number in a variable that is displayed somewhere in the application. Updating this number every time you make a change doesn't matter, but ensure that it's updated for every release to the user and ties in with your own release logs. Do stringing in the script and not in screen objects We would have put this in anyway, but its place in the article was assured by a recent experience on a user site. They wanted four lines of address and a postcode strung together in a single field, with each part separated by a comma and a space. However, any field could contain nulls; so, to avoid addresses such as ',,,,' or ', Somewhere ,,,', there had be a check for null in every field as the fields were strung together. The table only contained about 350 rows, but it took 56 seconds to refresh on screen when the work was done in an expression in a straight table. Moving the expression to the script and presenting just the resulting single field on screen took only 0.14 seconds. (That's right; it's about a seventh of a second). Plus, it didn't adversely affect script performance. We can't think of a better example of improving screen performance. The surprising data sources QlikView will read database tables, spreadsheets, XML files, and text files, but did you know that it can also take data from a web page? If you need some standard data from the Internet, there's no need to create your own version. Just grab it from a web page! How about ISO Country Codes? Here's an example. Open the script and click on Web files… below Data from Filesto the right of the bottom section of the screen. This will open the File Wizard: Source dialogue, as in the following screenshot. Enter the URL where the table of data resides: Then, click on Next and in this case, select @2 under Tables, as shown in the following screenshot: Click on Finish and your script will look something similar to this: LOAD F1, Country, A2, A3, Number FROM [http://www.airlineupdate.com/content_public/codes/misc_codes/icao _nat.htm] (html, codepage is 1252, embedded labels, table is @2); Now, you've got a great lookup table in about 30 seconds; it will take another few seconds to clean it up for your own purposes. One small caveat though—web pages can change address, content, and structure, so it's worth putting in some validation around this if you think there could be any volatility. Include files We have already said that you should use MUST_INCLUDE rather than INCLUDE, but we're always surprised that many developers never use include files at all. If the same code needs to be used in more than one place, it really should be in an include file. Suppose that you have several documents that use C:QlikFilesFinanceBudgets.xlsx and that the folder name is hard coded in all of them. As soon as the file is moved to another location, you will have several modifications to make, and it's easy to miss changing a document because you may not even realise it uses the file. The solution is simple, very effective, and guaranteed to save you many reload failures. Instead of coding the full folder name, create something similar to this: LET vBudgetFolder='C:QlikFilesFinance'; Put the line into an include file, for instance, FolderNames.inc. Then, code this into each script as follows: $(MUST_INCLUDE=FolderNames.inc) Finally, when you want to refer to your Budgets.xlsx spreadsheet, code this: $(vBudgetFolder)Budgets.xlsx Now, if the folder path has to change, you only need to change one line of code in the include file, and everything will work fine as long as you implement include files in all your documents. Note that this works just as well for folders containing QVD files and so on. You can also use this technique to include LOAD from QVDs or spreadsheets because you should always aim to have just one version of the truth. Change logs Unfortunately, one of the things QlikView is not great at is version control. It can be really hard to see what has been done between versions of a document, and using the -prj folder feature can be extremely tedious and not necessarily helpful. So, this means that you, as the developer, need to maintain some discipline over version control. To do this, ensure that you have an area of comments that looks something similar to this right at the top of your script: // Demo.qvw // // Roger Stone - One QV Ltd - 04-Jul-2015 // // PURPOSE // Sample code for QlikView Unlocked - Chapter 6 // // CHANGE LOG // Initial version 0.1 // - Pull in ISO table from Internet and local Excel data // // Version 0.2 // Remove unused fields and rename incoming ISO table fields to // match local spreadsheet // Ensure that you update this every time you make a change. You could make this even more helpful by explaining why the change was made and not just what change was made. You should also comment the expressions in charts when they are changed. Summary In this article, we covered few coding tips, the surprising data sources, include files, and change logs. Resources for Article: Further resources on this subject: Qlik Sense's Vision [Article] Securing QlikView Documents [Article] Common QlikView script errors [Article]
Read more
  • 0
  • 0
  • 12344

article-image-understanding-text-search-and-hierarchies-sap-hana
Packt
20 Oct 2015
9 min read
Save for later

Understanding Text Search and Hierarchies in SAP HANA

Packt
20 Oct 2015
9 min read
In this article by Vinay Singh, author of the book Real Time Analytics with SAP HANA, this article covers Full Text Search and hierarchies in SAP HANA, and how to create and use them in our data models. After completing this article, you should be able to: Create and use Full Text Search Create hierarchies—level and parent child hierarchies (For more resources related to this topic, see here.) Creating and using Full Text Search Before we proceed with the creation and use of Full Text Search, let's quickly go through the basic terms associated with it. They are as follows: Text Analysis: This is the process of analyzing unstructured text, extracting relevant information, and then transforming this information into structure information that can be leveraged in different ways. The scripts provide additional possibilities to analyze strings or large text columns by providing analysis rules for many industries in many languages in SAP HANA. Full Text Search: This capability of HANA helps to speed up search capabilities within large amounts of text data significantly. The primary function of Full Text Search is to optimize linguistic searches. Fuzzy Search: This functionality enables to find strings that match a pattern approximately (rather than exactly). It's a fault-tolerant search, meaning that a query returns records even if the search term contains additional or missing characters, or even spelling mistakes. It is an alternative to a non-fault tolerant SQL statement. The score() function: When using contains() in the where clause of a select statement, the score() function can be used to retrieve the score. This is a numeric value between 0.0 and 1.0. The score defines the similarity between the user input and the records returned by the search. A score of 0.0 means that there is no similarity. The higher the score, the more similar a record is to the search input. Some of the applied applications of fuzzy search could be: Fault-tolerant check for duplicate records. Its helps to prevent duplication entry in Systems by searching similar entries. Fault-tolerant search in text columns—for example, search documents on diode and find all documents that contain the term "triode". Fault-tolerant search in structure database content search for rhyming words, for example coffee Krispy biscuit and find toffee crisp biscuits (the standard example given by SAP). Let's see what are the use cases for text search: Combining structure and unstructured data Medicine and healthcare Patents Brand monitoring and the buying pattern of consumer Real-time analytics on a large volume of data Data from social media Finance data Sales optimization Monitoring and production planning The results of text analysis are stored in a table and therefore, can be leveraged in all the HANA- supported scenarios: Standard Analytics: Create analytical views and calculation views on top. For example, companies mentioned in news articles over time. Data mining, predictive: Using R, Predictive Analysis Library (PAL) functions. For example, clustering, time series analysis, and so on. Search-based applications: Create a search model and build a search UI with the HANA Info Access (InA) toolkit for HTML5. Text analysis results can be used to navigate and filter search results. For example, People finder, search UI for internal documents. The capabilities of HANA Full Text Search and text analysis are as follows: Native full text search Database text analysis The graphical modeling of search models Info Access toolkit for HTML5 UIs. The benefits of full text search: Extract unstructured content with no additional cost Combine structure and unstructured information for unified information access Less data duplication and transfer Harness the benefit of InA (Info Access toolkit ) for an HTML5 application The following are the supported data types by fuzzy search: Short text Text VARCHAR NVARCHAR Date Data with full text index. Enabling search option Before we can use the search option in any attribute or analytical view, we will need to enable this functionality in the SAP HANA Studio Preferences as shown in the following screenshot: We are well prepared to move ahead with the creation and use of Full Text search. Let's do this step by step as follows: Create the table that we will use to perform the Full Text Search statements: Create Schema <DEMO>; // I am creating , it would be already present from our previous exercises. SET SCHEMA DEMO; // Set the schema name Create a Column Table including FUZZY SEARCH indexed columns. DROP TABLE DEMO.searchtbl_FUZZY; CREATE COLUMN TABLE DEMO.searchtbl_FUZZY ( CUST_NAME TEXT FUZZY SEARCH INDEX ON, CUST_COUNTY TEXT FUZZY SEARCH INDEX ON, CUST_DEPT TEXT FUZZY SEARCH INDEX ON, ); Prepare the fuzzy search logic (SQL logic): Search for customers in the countries that contain the 'MAIN' word: SELECT score() AS score, * FROM searchtbl_FUZZY WHERE CONTAINS(cust_county, 'MAIN'); Search for customers in the countries that contain the 'MAIN' word but with Fuzzy parameter 0.4 SELECT score() AS score, * FROM searchtbl_FUZZY WHERE CONTAINS(cust_county, 'West', FUZZY(0.3)); Perform a fuzzy search for a customer working in a department that includes the department word : SELECT highlighted(cust_dept), score() AS score, * FROM searchtbl_FUZZY WHERE CONTAINS(cust_dept, 'Department', FUZZY(0.5)); Fuzzy search for all the columns by looking for the customer word: SELECT score() AS score, * FROM searchtbl_FUZZY WHERE CONTAINS(*, 'Customer', FUZZY(0.5)); Creating hierarchies Hierarchies are created to maintain data in a structured format, such as maintaining customer or employee data based on their roles and splitting the data based on geographies. Hierarchical data is very useful for organizational purposes during decision making. Two types of hierarchies can be created in SAP HANA: The level hierarchy Parent-child hierarchy The hierarchies are initially created in the attribute view and later can be combined in the analytic view or calculation view for consumption in a report as per business requirements. Let's create both types of hierarchies in attribute views. Creating level hierarchy Each level represents a position in the hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Each level above the base level contains aggregate values for the levels below it. Create a new attribute view (for your own practice, I would suggest you to create a new one). You can also use an existing one. Use the SNWD_PD EPM sample tables. In output view, mark the following as output: In the semantic node of the view, create new hierarchy as shown in the following screenshot and fill the details: Save and Activate the view. Now the hierarchy is ready to be used in an analytical view. Add a client and node key again as output to your attribute view that you just created, that is AT_LEVEL_HIERARCY_DEMO, as we will use these two fields in Create an analytical view. It should look like the following screenshot. Add the attribute view created in the preceding step and the SNWD_SO_I table to the data foundation: Join client to client and product guide to node key:  Save and activate. Go to MS Excel | All Programs | Microsoft Office | Microsoft Excel 2010 then go to Data tab | From Other Sources | From Data Connection Wizard. You will get a new popup for Data Connection Wizard | Other/Advanced | SAP HANA MDX Provider: You will be asked to provide the connection details, fill the details, and test the connection (these are the same details that you used while adding the system to SAP HANA Studio). Data Connection Wizard will now ask you to choose the analytical view (choose the one that you just created in the preceding step): The preceding steps will take you to an excel sheet and you will see data as per the choices that you chose in the Pivot table field list: Create parent-child hierarchy The parent-child hierarchy is a simple, two-level hierarchy where the child element has an attribute containing the parent element. These two columns define the hierarchical relationships among the members of the dimension. The first column, called the member key column, identifies each dimension member. The other column, called the parent column, identifies the parent of each dimension member. The parent attribute determines the name of each level in the parent-child hierarchy and determines whether the data for parent members should be displayed  Let's create a parent-child hierarchy using the following steps: Create an attribute view. Create a table that has the parent-child information: The following is the sample code and the insert statement: CREATE COLUMN TABLE "DEMO"."CCTR_HIE"( "CC_CHILD" NVARCHAR(4), "CC_PARENT" NVARCHAR(4)); insert into "DEMO"."CCTR_HIE" values('','') insert into "DEMO"."CCTR_HIE" values('C11','c1'); insert into "DEMO"."CCTR_HIE" values('C12','c1'); insert into "DEMO"."CCTR_HIE" values('C13','c1'); insert into "DEMO"."CCTR_HIE" values('C14','c2'); insert into "DEMO"."CCTR_HIE" values('C21','c2'); insert into "DEMO"."CCTR_HIE" values('C22','c2'); insert into "DEMO"."CCTR_HIE" values('C31','c3'); insert into "DEMO"."CCTR_HIE" values('C1','c'); insert into "DEMO"."CCTR_HIE" values('C2','c'); insert into "DEMO"."CCTR_HIE" values('C3','c'); We will put the preceding table into our data foundation of attribute view as follows: Make CC_CHILD as the key attribute. Now let's create new hierarchy as shown in the following screenshot: Save and activate the hierarchy. Create a new analytical view and add the HIE_PARENT_CHILD_DEMO view and the CCTR_COST table in data foundation. Join CCTR to CCTR_CILD with many is to one relationship. Make sure that in the semantic node, COST is set as a measure. Save and Activate the analytical view. Preview the data. As per the business need, we can use one of the two hierarchies along with attribute view or analytical view. Summary In this article, we took a deep dive into Full Text Search, fuzzy logic, and hierarchies concepts. We learned how to create and use text search and fuzzy logic. The parent-child and level hierarchies were discussed in detail with a hands-on approach on both. Resources for Article: Further resources on this subject: Sabermetrics with Apache Spark [article] Meeting SAP Lumira [article] Achieving High-Availability on AWS Cloud [article]
Read more
  • 0
  • 0
  • 16391

article-image-overview-oozie
Packt
19 Oct 2015
5 min read
Save for later

An Overview of Oozie

Packt
19 Oct 2015
5 min read
In this article by Jagat Singh, the author of the book Apache Oozie Essentials, we will see a basic overview of Oozie and its concepts in brief. (For more resources related to this topic, see here.) Concepts Oozie is a workflow scheduler system to run Apache Hadoop jobs. Oozie workflow jobs are Directed Acyclic Graphs (DAGs) (https://en.wikipedia.org/wiki/Directed_acyclic_graph) representation of actions. Actions tell what to do in the job. Oozie supports running jobs of various types such as Java, Map-reduce, Pig, Hive, Sqoop, Spark, and Distcp. The output of one action can be consumed by the next action to create chain sequence. Oozie has client server architecture, in which we install the server for storing the jobs and using client we submit our jobs to the server. Let's get an idea of few basic concepts of Oozie. Workflow Workflow tells Oozie 'what' to do. It is a collection of actions arranged in required dependency graph. So as part of workflows definition we write some actions and call them in certain order. These are of various types for tasks, which we can do as part of workflow for example, Hadoop filesystem action, Pig action, Hive action, Mapreduce action , Spark action, and so on. Coordinator Coordinator tells Oozie 'when' to do. Coordinators let us to run inter-dependent workflows as data pipelines based on some starting criteria. Most of the Oozie jobs are triggered at given scheduled time interval or when input dataset is present for triggering the job. Following are important definitions related to coordinators: Nominal time: The scheduled time at which job should execute. Example, we process pressrelease every day at 8:00PM. Actual time: The real time when the job ran. In some cases if the input data does not arrive the job might start late. This type of data dependent job triggering is indicated by done-flag (more on this later). The done-flag gives signal to start the job execution. The general skeleton template of coordinator is shown in the following figure: Bundles Bundles tell Oozie which all things to do together as a group. For example a set of coordinators, which can be run together to satisfy a given business requirement can be combined as Bundle. Book case study One of the main used cases of Hadoop is ETL data processing. Suppose that we work for a large consulting company and have won project to setup Big data cluster inside customer data center. On high level the requirements are to setup environment that will satisfy the following flow: We get data from various sources in Hadoop (File based loads, Sqoop based loads) We preprocess them with various scripts (Pig, Hive, Mapreduce) Insert that data into Hive tables for use by analyst and data scientists Data scientists write machine learning models (Spark) We will be using Oozie as our processing scheduling system to do all the above. In our architecture we have one landing server, which sits outside as front door of the cluster. All source systems send files to us via scp and we regularly (for example, nightly to keep simple) push them to HDFS using the hadoop fs -copyFromLocal command. This script is cron driven. It has very simple business logic run every night at 8:00 PM and moves all the files, which it sees, on landing server into HDFS. The Oozie works as follows: Oozie picks the file and cleans it using Pig Script to replace all the delimiters from comma (,) to pipes (|). We will write the same code using Pig and Map Reduce. We then push those processed files into a Hive table. For different source system which is database based MySQL table we do nightly Sqoop when the load of Database in light. So we extract all the records that have been generated on previous business day. The output of that also we insert into Hive tables. Analyst and Data scientists write there magical Hive scripts and Spark machine learning models on those Hive tables. We will use Oozie to schedule all of these regular tasks. Node types Workflow is composed on nodes; the logical DAG of nodes represents 'what' part of the work done by Oozie. Each of the node does specified work and on success moves to one node or on failure moves to other node. For example on success go to OK node and on fail goes to Kill node. Nodes in the Oozie workflow are of the following types. Control flow nodes These nodes are responsible for defining start, end, and control flow of what to do inside the workflow. These can be from following: Start node End node Kill node Decision node Fork and Join node Action nodes Actions nodes represent the actual processing tasks, which are executed when called. These are of various types for example Pig action, Hive action, and Mapreduce action. Summary So in this article we looked at the concepts of Oozie in brief. We also learnt the types on nodes in Oozie. Resources for Article: Further resources on this subject: Introduction to Hadoop[article] Hadoop and HDInsight in a Heartbeat[article] Cloudera Hadoop and HP Vertica [article]
Read more
  • 0
  • 0
  • 2490
article-image-sql-server-powershell
Packt
19 Oct 2015
8 min read
Save for later

SQL Server with PowerShell

Packt
19 Oct 2015
8 min read
In this article by Donabel Santos, author of the book, SQL Server 2014 with Powershell v5 Cookbook explains scripts and snippets of code that accomplish basic SQL Server tasks using PowerShell. She discusses simple tasks such as Listing SQL Server Instances and Discovering SQL Server Services to make you comfortable working with SQL Server programmatically. However, even if ever you explore how to create some common database objects using PowerShell, keep in mind that PowerShell will not always be the best tool for the task. There will be tasks that are best completed using T-SQL. It is still good to know what is possible in PowerShell and how to do them, so you know that you have alternatives depending on your requirements or situation. For the recipes, we are going to use PowerShell ISE quite a lot. If you prefer running the script from the PowerShell console rather run running the commands from the ISE, you can save the scripts in a .ps1 file and run it from the PowerShell console. (For more resources related to this topic, see here.) Listing SQL Server Instances In this recipe, we will list all SQL Server Instances in the local network. Getting ready Log in to the server that has your SQL Server development instance as an administrator. How to do it... Let's look at the steps to list your SQL Server instances: Open PowerShell ISE as administrator. Let's use the Start-Service cmdlet to start the SQL Browser service: Import-Module SQLPS -DisableNameChecking #out of the box, the SQLBrowser is disabled. To enable: Set-Service SQLBrowser -StartupType Automatic #sql browser must be installed and running for us #to discover SQL Server instances Start-Service "SQLBrowser" Next, you need to create a ManagedComputer object to get access to instances. Type the following script and run: $instanceName = "localhost" $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName #list server instances $managedComputer.ServerInstances Your result should look similar to the one shown in the following screenshot: Notice that $managedComputer.ServerInstances gives you not only instance names, but also additional properties such as ServerProtocols, Urn, State, and so on. Confirm that these are the same instances you see from SQL Server Management Studio. Open SQL Server Management Studio. Go to Connect | Database Engine. In the Server Name dropdown, click on Browse for More. Select the Network Servers tab and check the instances listed. Your screen should look similar to this: How it works... All services in a Windows operating system are exposed and accessible using Windows Management Instrumentation (WMI). WMI is Microsoft's framework for listing, setting, and configuring any Microsoft-related resource. This framework follows Web-based Enterprise Management (WBEM). The DISTRIBUTED MANAGEMENT TASK FORCE, INC. (http://www.dmtf.org/standards/wbem) defines WBEM as follows: A set of management and Internet standard technologies developed to unify the management of distributed computing environments. WBEM provides the ability for the industry to deliver a well-integrated set of standard-based management tools, facilitating the exchange of data across otherwise disparate technologies and platforms. In order to access SQL Server WMI-related objects, you can create a WMI ManagedComputer instance: $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName The ManagedComputer object has access to a ServerInstance property, which in turn lists all available instances in the local network. These instances however are only identifiable if the SQL Server Browser service is running. The SQL Server Browser is a Windows Service that can provide information on installed instances in a box. You need to start this service if you want to list the SQL Server-related services. There's more... The Services instance of the ManagedComputer object can also provide similar information, but you will have to filter for the server type SqlServer: #list server instances $managedComputer.Services | Where-Object Type –eq "SqlServer" | Select-Object Name, State, Type, StartMode, ProcessId Your result should look like this: Instead of creating a WMI instance by using the New-Object method, you can also use the Get-WmiObject cmdlet when creating your variable. Get-WmiObject, however, will not expose exactly the same properties exposed by the Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer object. To list instances using Get-WmiObject, you will need to discover what namespace is available in your environment: $hostName = "localhost" $namespace = Get-WMIObject -ComputerName $hostName -Namespace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" #see matching namespace objects $namespace #see namespace names $namespace | Select-Object -ExpandProperty "__NAMESPACE" $namespace | Select-Object -ExpandProperty "Name" If you are using PowerShell v2, you will have to change the Where-Object cmdlet usage to use the curly braces {} and the $_ variable: Where-Object {$_.Name -like "ComputerManagement*" } For SQL Server 2014, the namespace value is: ROOTMicrosoftSQLServerComputerManagement12 This value can be derived from $namespace.__NAMESPACE and $namespace.Name. Once you have the namespace, you can use this with Get-WmiObject to retrieve the instances. We can use the SqlServiceType property to filter. According to MSDN (http://msdn.microsoft.com/en-us/library/ms179591.aspx), these are the values of SqlServiceType: SqlServiceType Description 1 SQL Server Service 2 SQL Server Agent Service 3 Full-Text Search Engine Service 4 Integration Services Service 5 Analysis Services Service 6 Reporting Services Service 7 SQL Browser Service Thus, to retrieve the SQL Server instances, we need to provide the full namespace ROOTMicrosoftSQLServerComputerManagement12. We also need to filter for SQL Server Service type, or SQLServiceType = 1. The code is as follows: Get-WmiObject -ComputerName $hostName -Namespace "$($namespace.__NAMESPACE)$($namespace.Name)" -Class SqlService | Where-Object SQLServiceType -eq 1 | Select-Object ServiceName, DisplayName, SQLServiceType | Format-Table –AutoSize Your result should look similar to the following screenshot: Yet another way to list all the SQL Server instances in the local network is by using the System.Data.Sql.SQLSourceEnumerator class, instead of ManagedComputer. This class has a static method called Instance.GetDataSources that will list all SQL Server instances: [System.Data.Sql.SqlDataSourceEnumerator]: :Instance.GetDataSources() | Format-Table -AutoSize When you execute, your result should look similar to the following: If you have multiple SQL Server versions, you can use the following code to display your instances: #list services using WMI foreach ($path in $namespace) { Write-Verbose "SQL Services in:$($path.__NAMESPACE)$($path.Name)" Get-WmiObject -ComputerName $hostName ` -Namespace "$($path.__NAMESPACE)$($path.Name)" ` -Class SqlService | Where-Object SQLServiceType -eq 1 | Select-Object ServiceName, DisplayName, SQLServiceType | Format-Table –AutoSize } Discovering SQL Server Services In this recipe, we will enumerate all SQL Server Services and list their statuses. Getting ready Check which SQL Server services are installed in your instance. Go to Start | Run and type services.msc. You should see a screen similar to this: How to do it... Let's assume you are running this script on the server box: Open PowerShell ISE as administrator. Add the following code and execute: Import-Module SQLPS -DisableNameChecking #you can replace localhost with your instance name $instanceName = "localhost" $managedComputer = New-Object Microsoft.SqlServer.Management.Smo.Wmi.ManagedComputer $instanceName #list services $managedComputer.Services | Select-Object Name, Type, ServiceState, DisplayName | Format-Table -AutoSize Your result will look similar to the one shown in the following screenshot: Items listed in your screen will vary depending on the features installed and running in your instance Confirm that these are the services that exist in your server. Check your services window. How it works... Services that are installed on a system can be queried using WMI. Specific services for SQL Server are exposed through SMO's WMI ManagedComputer object. Some of the exposed properties are as follows: ClientProtocols ConnectionSettings ServerAliases ServerInstances Services There's more... An alternative way to get SQL Server-related services is by using Get-WMIObject. We will need to pass in the host name as well as the SQL Server WMI Provider for the ComputerManagement namespace. For SQL Server 2014, this value is ROOTMicrosoftSQLServerComputerManagement12. The script to retrieve the services is provided here. Note that we are dynamically composing the WMI namespace. The code is as follows: $hostName = "localhost" $namespace = Get-WMIObject -ComputerName $hostName -NameSpace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" Get-WmiObject -ComputerName $hostname -Namespace "$($namespace.__NAMESPACE)$($namespace.Name)" -Class SqlService | Select-Object ServiceName If you have multiple SQL Server versions installed and want to see just the most recent version's services, you can limit to the latest namespace by adding Select-Object –Last 1: $namespace = Get-WMIObject -ComputerName $hostName -NameSpace rootMicrosoftSQLServer -Class "__NAMESPACE" | Where-Object Name -like "ComputerManagement*" | Select-Object –Last 1 Yet another alternative but less accurate way of listing possible SQL Server related services is the following snippet of code: #alterative - but less accurate Get-Service *SQL* This uses the Get-Service cmdlet and filters base on the service name. This is less accurate because this grabs all processes that have SQL in the name, but may not necessarily be related to SQL Server. For example, if you have MySQL installed, it will get picked up as a process. Conversely, this will not pick up SQL Server-related services that do not have SQL in the name, such as ReportServer. Summary You will find that many of the scripts can be accomplished using PowerShell and SQL Management Objects (SMO). SMO is a library that exposes SQL Server classes that allow programmatic manipulation and automation of many database tasks. For some , we will also explore alternative ways of accomplishing the same tasks using different native PowerShell cmdlets. Now that we have a gist of SQL Server 2014 with PowerShell, lets build a full-fledged e-commerce project with SQL Server 2014 with Powershell v5 Cookbook. Resources for Article: Further resources on this subject: Exploring Windows PowerShell 5.0 [article] Working with PowerShell [article] Installing/upgrading PowerShell [article]
Read more
  • 0
  • 0
  • 9069

article-image-introducing-test-driven-machine-learning
Packt
14 Oct 2015
19 min read
Save for later

Introducing Test-driven Machine Learning

Packt
14 Oct 2015
19 min read
In this article by Justin Bozonier, the author of the book Test Driven Machine Learning, we will see how to develop complex software (sometimes rooted in randomness) in small, controlled steps also it will guide you on how to begin developing solutions to machine learning problems using test-driven development (from here, this will be written as TDD). Mastering TDD is not something the book will achieve. Instead, the book will help you begin your journey and expose you to guiding principles, which you can use to creatively solve challenges as you encounter them. We will answer the following three questions in this article: What are TDD and behavior-driven development (BDD)? How do we apply these concepts to machine learning, and making inferences and predictions? How does this work in practice? (For more resources related to this topic, see here.) After having answers to these questions, we will be ready to move onto tackling real problems. The book is about applying these concepts to solve machine learning problems. This article is the largest theoretical explanation that we will have with the remainder of the theory being described by example. Due to the focus on application, you will learn much more than what you can learn about the theory of TDD and BDD. To read more about the theory and ideals, search the internet for articles written by the following: Kent Beck—The father of TDD Dan North—The father of BDD Martin Fowler—The father of refactoring, he has also created a large knowledge base, on these topics James Shore—one of the author of The Art of Agile Development, has a deep theoretical understanding of TDD, and explains the practical value of it quite well These concepts are incredibly simple and yet can take a lifetime to master. When applied to machine learning, we must find new ways to control and/or measure the random processes inherent in the algorithm. This will come up in this article as well as others. In the next section, we will develop a foundation for TDD and begin to explore its application. Test-driven development Kent Beck wrote in his seminal book on the topic that TDD consists of only two specific rules, which are as follows: Don't write a line of new code unless you first have a failing automated test Eliminate duplication This as he noted fairly quickly leads us to a mantra, really the mantra of TDD: Red, Green, Refactor. If this is a bit abstract, let me restate it that TDD is a software development process that enables a programmer to write code that specifies the intended behavior before writing any software to actually implement the behavior. The key value of TDD is that at each step of the way, you have working software as well as an itemized set of specifications. TDD is a software development process that requires the following: Writing code to detect the intended behavioral change. Rapid iteration cycle that produces working software after each iteration. It clearly defines what a bug is. If a test is not failing but a bug is found, it is not a bug. It is a new feature. Another point that Kent makes is that ultimately, this technique is meant to reduce fear in the development process. Each test is a checkpoint along the way to your goal. If you stray too far from the path and wind up in trouble, you can simply delete any tests that shouldn't apply, and then work your code back to a state where the rest of your tests pass. There's a lot of trial and error inherent in TDD, but the same matter applies to machine learning. The software that you design using TDD will also be modular enough to be able to have different components swapped in and out of your pipeline. You might be thinking that just thinking through test cases is equivalent to TDD. If you are like the most people, what you write is different from what you might verbally say, and very different from what you think. By writing the intent of our code before we write our code, it applies a pressure to our software design that prevents you from writing "just in case" code. By this I mean the code that we write just because we aren't sure if there will be a problem. Using TDD, we think of a test case, prove that it isn't supported currently, and then fix it. If we can't think of a test case, we don't add code. TDD can and does operate at many different levels of the software under development. Tests can be written against functions and methods, entire classes, programs, web services, neural networks, random forests, and whole machine learning pipelines. At each level, the tests are written from the perspective of the prospective client. How does this relate to machine learning? Lets take a step back and reframe what I just said. In the context of machine learning, tests can be written against functions, methods, classes, mathematical implementations, and the entire machine learning algorithms. TDD can even be used to explore technique and methods in a very directed and focused manner, much like you might use a REPL (an interactive shell where you can try out snippets of code) or the interactive (I)Python session. The TDD cycle The TDD cycle consists of writing a small function in the code that attempts to do something that we haven't programmed yet. These small test methods will have three main sections; the first section is where we set up our objects or test data; another section is where we invoke the code that we're testing; and the last section is where we validate that what happened is what we thought would happen. You will write all sorts of lazy code to get your tests to pass. If you are doing it right, then someone who is watching you should be appalled at your laziness and tiny steps. After the test goes green, you have an opportunity to refactor your code to your heart's content. In this context, refactor refers to changing how your code is written, but not changing how it behaves. Lets examine more deeply the three steps of TDD: Red, Green, and Refactor. Red First, create a failing test. Of course, this implies that you know what failure looks like in order to write the test. At the highest level in machine learning, this might be a baseline test where baseline is a better than random test. It might even be predicts random things, or even simpler always predicts the same thing. Is this terrible? Perhaps, it is to some who are enamored with the elegance and artistic beauty of his/her code. Is it a good place to start, though? Absolutely. A common issue that I have seen in machine learning is spending so much time up front, implementing the one true algorithm that hardly anything ever gets done. Getting to outperform pure randomness, though, is a useful change that can start making your business money as soon as it's deployed. Green After you have established a failing test, you can start working to get it green. If you start with a very high-level test, you may find that it helps to conceptually break that test up into multiple failing tests that are the lower-level concerns. I'll dive deeper into this later on in this article but for now, just know that you want to get your test passing as soon as possible; lie, cheat, and steal to get there. I promise that cheating actually makes your software's test suite that much stronger. Resist the urge to write the software in an ideal fashion. Just slap something together. You will be able to fix the issues in the next step. Refactor You got your test to pass through all the manners of hackery. Now, you get to refactor your code. Note that it is not to be interpreted loosely. Refactor specifically means to change your software without affecting its behavior. If you add the if clauses, or any other special handling, you are no longer refactoring. Then you write the software without tests. One way where you will know for sure that you are no longer refactoring is that you've broken previously passing tests. If this happens, we back up our changes until our tests pass again. It may not be obvious but this isn't all that it takes for you to know that you haven't changed behavior. Read Refactoring: Improving the Design of Existing Code, Martin Fowler for you to understand how much you should really care for refactoring. By the way of his illustration in this book, refactoring code becomes a set of forms and movements not unlike karate katas. This is a lot of general theory, but what does a test actually look like? How does this process flow in a real problem? Behavior-driven development BDD is the addition of business concerns to the technical concerns more typical of TDD. This came about as people became more experienced with TDD. They started noticing some patterns in the challenges that they were facing. One especially influential person, Dan North, proposed some specific language and structure to ease some of these issues. Some issues he noticed were the following: People had a hard time understanding what they should test next. Deciding what to name a test could be difficult. How much to test in a single test always seemed arbitrary. Now that we have some context, we can define what exactly BDD is. Simply put, it's about writing our tests in such a way that they will tell us the kind of behavior change they affect. A good litmus test might be asking oneself if the test you are writing would be worth explaining to a business stakeholder. How this solves the previous may not be completely obvious, but it may help to illustrate what this looks like in practice. It follows a structure of given, when, then. Committing to this style completely can require specific frameworks or a lot of testing ceremony. As a result, I loosely follow this in my tests as you will see soon. Here's a concrete example of a test description written in this style Given an empty dataset when the classifier is trained, it should throw an invalid operation exception. This sentence probably seems like a small enough unit of work to tackle, but notice that it's also a piece of work that any business user, who is familiar with the domain that you're working in, would understand and have an opinion on. You can read more about Dan North's point of view in this article on his website at dannorth.net/introducing-bdd/. The BDD adherents tend to use specialized tools to make the language and test result reports be as accessible to business stakeholders as possible. In my experience and from my discussions with others, this extra elegance is typically used so little that it doesn't seem worthwhile. The approach you will learn in the book will take a simplicity first approach to make it as easy as possible for someone with zero background to get up to speed. With this in mind, lets work through an example. Our first test Let's start with an example of what a test looks like in Python. The main reason for using this is that while it is a bit of a pain to install a library, this library, in particular, will make everything that we do much simpler. The default unit test solution in Python requires a heavier set up. On top of this, by using nose, we can always mix in tests that use the built-in solution where we find that we need the extra features. First, install it like this: pip install nose If you have never used pip before, then it is time for you to know that it is a very simple way to install new Python libraries. Now, as a hello world style example, lets pretend that we're building a class that will guess a number using the previous guesses to inform it. This is the first simplest example to get us writing some code. We will use the TDD cycle that we discussed previously, and write our first test in painstaking detail. After we get through our first test and have something concrete to discuss, we will talk about the anatomy of the test that we wrote. First, we must write a failing test. The simplest failing test that I can think of is the following: def given_no_information_when_asked_to_guess_test(): number_guesser = NumberGuesser() result = number_guesser.guess() assert result is None, "Then it should provide no result." The context for assert is in the test name. Reading the test name and then the assert name should do a pretty good job of describing what is being tested. Notice that in my test, I instantiate a NumberGuesser object. You're not missing any steps; this class doesn't exist yet. This seems roughly like how I'd want to use it. So, it's a great place to start with. Since it doesn't exist, wouldn't you expect this test to fail? Lets test this hypothesis. To run the test, first make sure your test file is saved so that it ends in _tests.py. From the directory with the previous code, just run the following: nosetests When I do this, I get the following result: Here's a lot going on here, but the most informative part is near the end. The message is saying that NumberGuesser does not exist yet, which is exactly what I expected since we haven't actually written the code yet. Throughout the book, we'll reduce the detail of the stack traces that we show. For now, we'll keep things detailed to make sure that we're on the same page. At this point, we're in a red state in the TDD cycle. Use the following steps to create our first successful test: Now, create the following class in a file named NumberGuesser.py: class NumberGuesser: """Guesses numbers based on the history of your input"" Import the new class at the top of my test file with a simple import NumberGuesser statement. I rerun nosetests, and get the following: TypeError: 'module' object is not callable Oh whoops! I guess that's not the right way to import the class. This is another very tiny step, but what is important is that we are making forward progress through constant communication with our tests. We are going through extreme detail because I can't stress this point enough; bear with me for the time being. Change the import statement to the following: from NumberGuesser import NumberGuesser Rerun nosetests and you will see the following: AttributeError: NumberGuesser instance has no attribute 'guess' The error message has changed, and is leading to the next thing that needs to be changed. From here, just implement what we think we need for the test to pass: class NumberGuesser: """Guesses numbers based on the history of your input""" def guess(self): return None On rerunning the nosetests, we'll get the following result: That's it! Our first successful test! Some of these steps seem so tiny so as to not being worthwhile. Indeed, overtime, you may decide that you prefer to work on a different level of detail. For the sake of argument, we'll be keeping our steps pretty small if only to illustrate just how much TDD keeps us on track and guides us on what to do next. We all know how to write the code in very large, uncontrolled steps. Learning to code surgically requires intentional practice, and is worth doing explicitly. Lets take a step back and look at what this first round of testing took. Anatomy of a test Starting from a higher level, notice how I had a dialog with Python. I just wrote the test and Python complained that the class that I was testing didn't exist. Next, I created the class, but then Python complained that I didn't import it correctly. So then, I imported it correctly, and Python complained that my guess method didn't exist. In response, I implemented the way that my test expected, and Python stopped complaining. This is the spirit of TDD. You have a conversation between you and your system. You can work in steps as little or as large as you're comfortable with. What I did previously could've been entirely skipped over, and the Python class could have been written and imported correctly the first time. The longer you go without talking to the system, the more likely you are to stray from the path to getting things working as simply as possible. Lets zoom in a little deeper and dissect this simple test to see what makes it tick. Here is the same test, but I've commented it, and broken it into sections that you will see recurring in every test that you write: def given_no_information_when_asked_to_guess_test(): # given number_guesser = NumberGuesser() # when guessed_number = number_guesser.guess() # then assert guessed_number is None, 'there should be no guess.' Given This section sets up the context for the test. In the previous test, you acquired that I didn't provide any prior information to the object. In many of our machine learning tests, this will be the most complex portion of our test. We will be importing certain sets of data, sometimes making a few specific issues in the data and testing our software to handle the details that we would expect. When you think about this section of your tests, try to frame it as Given this scenario… In our test, we might say Given no prior information for NumberGuesser… When This should be one of the simplest aspects of our test. Once you've set up the context, there should be a simple action that triggers the behavior that you want to test. When you think about this section of your tests, try to frame it as When this happens… In our test we might say When NumberGuesser guesses a number… Then This section of our test will check on the state of our variables and any return result if applicable. Again, this section should also be fairly straight-forward, as there should be only a single action that causes a change into your object under the test. The reason for this is that if it takes two actions to form a test, then it is very likely that we will just want to combine the two into a single action that we can describe in terms that are meaningful in our domain. A key example maybe loading the training data from a file and training a classifier. If we find ourselves doing this a lot, then why not just create a method that loads data from a file for us? In the book, you will find examples where we'll have the helper functions help us determine whether our results have changed in certain ways. Typically, we should view these helper functions as code smells. Remember that our tests are the first applications of our software. Anything that we have to build in addition to our code, to understand the results, is something that we should probably (there are exceptions to every rule) just include in the code we are testing. Given, When, Then is not a strong requirement of TDD, because our previous definition of TDD only consisted of two things (all that the code requires is a failing test first and an eliminate duplication). It's a small thing to be passionate about and if it doesn't speak to you, just translate this back into Arrange, act, assert in your head. At the very least, consider it as well as why these specific, very deliberate words are used. Applied to machine learning At this point, you maybe wondering how TDD will be used in machine learning, and whether we use it on regression or classification problems. In every machine learning algorithm, there exists a way to quantify the quality of what you're doing. In the linear regression; it's your adjusted R2 value; in classification problems, it's an ROC curve (and the area beneath it) or a confusion matrix, and more. All of these are testable quantities. Of course, none of these quantities have a built-in way of saying that the algorithm is good enough. We can get around this by starting our work on every problem by first building up a completely naïve and ignorant algorithm. The scores that we get for this will basically represent a plain, old, and random chance. Once we have built an algorithm that can beat our random chance scores, we just start iterating, attempting to beat the next highest score that we achieve. Benchmarking algorithms are an entire field onto their own right that can be delved in more deeply. In the book, we will implement a naïve algorithm to get a random chance score, and we will build up a small test suite that we can then use to pit this model against another. This will allow us to have a conversation with our machine learning models in the same manner as we had with Python earlier. For a professional machine learning developer, it's quite likely that an ideal metric to test is a profitability model that compares risk (monetary exposure) to expected value (profit). This can help us keep a balanced view of how much error and what kind of error we can tolerate. In machine learning, we will never have a perfect model, and we can search for the rest of our lives for the best model. By finding a way to work your financial assumptions into the model, we will have an improved ability to decide between the competing models. Summary In this article, you were introduced to TDD as well as BDD. With these concepts introduced, you have a basic foundation with which to approach machine learning. We saw that the specifying behavior in the form of sentences makes for an easier to ready a set of specifications for your software. Building off of that foundation, we started to delve into testing at a higher level. We did this by establishing concepts that we can use to quantify classifiers: the ROC curve and AUC metric. Now, we've seen that different models can be quantified; it follows that they can be compared. Putting all of this together, we have everything we need to explore machine learning with a test-driven methodology. Resources for Article: Further resources on this subject: Optimization in Python[article] How to do Machine Learning with Python[article] Modeling complex functions with artificial neural networks [article]
Read more
  • 0
  • 0
  • 3494

article-image-transactions-and-operators
Packt
13 Oct 2015
14 min read
Save for later

Transactions and Operators

Packt
13 Oct 2015
14 min read
In this article by Emilien Kenler and Federico Razzoli, author of the book MariaDB Essentials, he has explained in brief about transactions and operators. (For more resources related to this topic, see here.) Understanding transactions A transaction is a sequence of SQL statements that are grouped into a single logical operation. Its purpose is to guarantee the integrity of data. If a transaction fails, no change will be applied to the databases. If a transaction succeeds, all the statements will succeed. Take a look at the following example: START TRANSACTION; SELECT quantity FROM product WHERE id = 42; UPDATE product SET quantity = quantity - 10 WHERE id = 42; UPDATE customer SET money = money - 0(SELECT price FROM product WHERE id = 42) WHERE id = 512; INSERT INTO product_order (product_id, quantity, customer_id) VALUES (42, 10, 512); COMMIT; We haven't yet discussed some of the statements used in this example. However, they are not important to understand transactions. This sequence of statements occur when a customer (whose id is 512) ordered a product (whose id is 42). As a consequence, we need to execute the following suboperations in our database: Check whether the desired quantity of products is available. If not, we should not proceed Decrease the available quantity of items for the product that is being bought Decrease the amount of money in the online account of our customer Register the order so that the product is delivered to our customer These suboperations form a more complex operation. When a session is executing this operation, we do not want other connections to interfere. Consider the following scenarios: Connection checks how many products with the ID 42 are available. Only one is available, but it is enough. Immediately after, the connection B checks the availability of the same product. It finds that one is available. Connection A decreases the quantity of the product. Now, it is 0. Connection B decreases the same number. Now, it is -1. Both connections create an order. Two persons will pay for the same product; however, only one is available. This is something we definitely want to avoid. However, there is another situation that we want to avoid. Imagine that the server crashes immediately after the customer's money is deducted. The order will not be written to the database, so the customer will end up paying for something he will not receive. Fortunately, transactions prevent both these situations. They protect our database writes in two ways: During a transaction, relevant data is locked or copied. In both these cases, two connections will not be able to modify the same rows at the same time. The writes will not be made effective until the COMMIT command is issued. This means that if the server crashes during the transaction, all the suboperations will be rolled back. We will not have inconsistent data (such as a payment for a product that will not be delivered). In this example, the transaction starts when we issue the START TRANSACTION command. Then, any number of operations can be performed. The COMMIT command makes the changes effective. This does not mean that if a statement fails with an error, the transaction is always aborted. In many cases, the application will receive an error and will be free to decide whether the transaction should be aborted or not. To abort the current transaction, an application can execute the ROLLBACK command. A transaction can consist of only one statement. This perfectly makes sense because the server could crash in the middle of the statement's execution. The autocommit mode In many cases, we don't want to group multiple statements in a transaction. When a transaction consists of only one statement, sending the START TRANSACTION and COMMIT statements can be annoying. For this reason, MariaDB has the autocommit mode. By default, the autocommit mode is ON. Unless a START TRANSACTION command is explicitly used, the autocommit mode causes an implicit commit after each statement. Thus, every statement is executed in a separated transaction by default. When the autocommit mode is OFF, a new transaction implicitly starts after each commit, and the COMMIT command needs be issued explicitly. To turn the autocommit ON or OFF, we can use the @@autocommit server variable as follows: follows: MariaDB [mwa]> SET @@autocommit = OFF; Query OK, 0 rows affected (0.00 sec) MariaDB [mwa]> SELECT @@autocommit; +--------------+ | @@autocommit | +--------------+ | 0 | +--------------+ 1 row in set (0.00 sec) Transaction's limitations in MariaDB Transaction handling is not implemented in the core of MariaDB; instead, it is left to the storage engines. Many storage engines, such as MyISAM or MEMORY, do not implement it at all. Some of the transactional storage engines are: InnoDB; XtraDB; TokuDB. In a sense, Aria tables are partially transactional. Although Aria ignores commands such as START TRANSACTION, COMMIT, and ROLLBACK, each statement is somewhat a transaction. In fact, if it writes, modifies, or deletes multiple rows, the operation completely succeeds or fails, which is similar to a transaction. Only statements that modify data can be used in a transaction. Statements that modify a table structure (such as ALTER TABLE) implicitly commit the current transaction. Sometimes, we may not be sure if a transaction is active or not. Usually, this happens because we are not sure if autocommit is set to ON or not or because we are not sure if the latest statement implicitly committed a transaction. In these cases, the @in_transaction variable can help us. Its value is 1 if a transaction is active and 0 if it is not. Here is an example: MariaDB [mwa]> START TRANSACTION; Query OK, 0 rows affected (0.00 sec) MariaDB [mwa]> SELECT @@in_transaction; +------------------+ | @@in_transaction | +------------------+ | 1 | +------------------+ 1 row in set (0.00 sec) MariaDB [mwa]> DROP TABLE IF EXISTS t; Query OK, 0 rows affected, 1 warning (0.00 sec) MariaDB [mwa]> SELECT @@in_transaction; +------------------+ | @@in_transaction | +------------------+ | 0 | +------------------+ 1 row in set (0.00 sec) InnoDB is optimized to execute a huge number of short transactions. If our databases are busy and performance is important to us, we should try to avoid big transactions in terms of the number of statements and execution time. This is particularly true if we have several concurrent connections that read the same tables. Working with operators In our examples, we have used several operators, such as equals (=), less-than and greater-than (<, >), and so on. Now, it is time to discuss operators in general and list the most important ones. In general, an operator is a sign that takes one or more operands and returns a result. Several groups of operators exist in MariaDB. In this article, we will discuss the main types: Comparison operators; String operators; Logical operators; Arithmetic operators. Comparison operators A comparison operator checks whether there is a certain relation between its operands. If the relationship exists, the operator returns 1; otherwise, it returns 0. For example, let's take the equality operator that is probably used the most: 1 = 1 -- returns 1: the equality relationship exists 1 = 0 -- returns 0: no equality relationship here In MariaDB, 1 and 0 are used in many contexts to indicate whether something is true or false. In fact, MariaDB does not have a Boolean data type, so TRUE and FALSE are merely used as aliases for 1 and 0: TRUE = 1 -- returns 1 FALSE = 0 -- returns 1 TRUE = FALSE -- returns 0 In a WHERE clause, a result of 0 or NULL prevents a row to be shown. All the numeric results other than 0, including negative numbers, are regarded as true in this context. Non-numeric values other than NULL need to be converted to numbers in order to be evaluated by the WHERE clause. Non-numeric strings are converted to 0, whereas numeric strings are treated as numbers. Dates are converted to nonzero numbers.Consider the following example: WHERE 1 -- is redundant; it shows all the rows WHERE 0 -- prevents all the rows from being shown Now, let's take a look at the following MariaDB comparison operators: Operator Description Example = This specifies equality A = B != This indicates inequality A != B <>  This is the synonym for != A <> B <  This denotes less than A < B >  This indicates greater than A > B <= This refers to less than or equals to A <= B >= This specifies greater than or equals to A >= B IS NULL This indicates that the operand is NULL A IS NULL IS NOT NULL The operand is not NULL A IS NOT NULL <=> This denotes that the operands are equal, or both are NULL A <=> B BETWEEN ... AND This specifies that the left operand is within a range of values A BETWEEN B AND C NOT BETWEEN ... AND This indicates that the left operand is outside the specified range A NOT BETWEEN B AND C IN This denotes that the left operand is one of the items in a given list A IN (B, C, D) NOT IN This indicates that the left operand is not in the given list A NOT IN (B, C, D) Here are a couple of examples: SELECT id FROM product WHERE price BETWEEN 100 AND 200; DELETE FROM product WHERE id IN (100, 101, 102); Special attention should be paid to NULL values. Almost all the preceding operators return NULL if any of their operands is NULL. The reason is quite clear, that is, as NULL represents an unknown value, any operation involving a NULL operand returns an unknown result. However, there are some operators specifically designed to work with NULL values. IS NULL and IS NOT NULL checks whether the operand is NULL. The <=> operator is a shortcut for the following code: a = b OR (a IS NULL AND b IS NULL) String operators MariaDB supports certain comparison operators that are specifically designed to work with string values. This does not mean that other operators does not work well with strings. For example, A = B perfectly works if A and B are strings. However, some particular comparisons only make sense with text values. Let's take a look at them. The LIKE operator and its variants This operator is often used to check whether a string starts with a given sequence of characters, if it ends with that sequence, or if it contains the sequence. More generally, LIKE checks whether a string follows a given pattern. Its syntax is: <string_value> LIKE <pattern> The pattern is a string that can contain the following wildcard characters: _ (underscore) means: This specifies any character %: This denotes any sequence of 0 or more characters There is also a way to include these characters without their special meaning: the _ and % sequences represent the a_ and a% characters respectively. For example, take a look at the following expressions: my_text LIKE 'h_' my_text LIKE 'h%' The first expression returns 1 for 'hi', 'ha', or 'ho', but not for 'hey'. The second expression returns 1 for all these strings, including 'hey'. By default, LIKE is case insensitive, meaning that 'abc' LIKE 'ABC' returns 1. Thus, it can be used to perform a case insensitive equality check. To make LIKE case sensitive, the following BINARY keyword can be used: my_text LIKE BINARY your_text The complement of LIKE is NOT LIKE, as shown in the following code: <string_value> NOT LIKE <pattern> Here are the most common uses for LIKE: my_text LIKE 'my%' -- does my_text start with 'my'? my_text LIKE '%my' -- does my_text end with 'my'? my_text LIKE '%my%' -- does my_text contain 'my'? More complex uses are possible for LIKE. For example, the following expression can be used to check whether mail is a valid e-mail address: mail LIKE '_%@_%.__%' The preceding code snippet checks whether mail contains at least one character, a '@' character, at least one character, a dot, at least two characters in this order. In most cases, an invalid e-mail address will not pass this test. Using regular expressions with the REGEXP operator and its variants Regular expressions are string patterns that contain a meta character with special meanings in order to perform match operations and determine whether a given string matches the given pattern or not. The REGEXP operator is somewhat similar to LIKE. It checks whether a string matches a given pattern. However, REGEXP uses regular expressions with the syntax defined by the POSIX standard. Basically, this means that: Many developers, but not all, already know their syntax REGEXP uses a very expressive syntax, so the patterns can be much more complex and detailed REGEXP is much slower than LIKE; this should be preferred when possible The regular expressions syntax is a complex topic, and it cannot be covered in this article. Developers can learn about regular expressions at www.regular-expressions.info. The complement of REGEXP is NOT REGEXP. Logical operators Logical operators can be used to combine truth expressions that form a compound expression that can be true, false, or NULL. Depending on the truth values of its operands, a logical operator can return 1 or 0. MariaDB supports the following logical operators: NOT; AND; OR; XOR The NOT operator NOT is the only logical operator that takes one operand. It inverts its truth value. If the operand is true, NOT returns 0, and if the operand is false, NOT returns 1. If the operand is NULL, NOT returns NULL. Here is an example: NOT 1 -- returns 0 NOT 0 -- returns 1 NOT 1 = 1 -- returns 0 NOT 1 = NULL -- returns NULL NOT 1 <=> NULL -- returns 0 The AND operator AND returns 1 if both its operands are true and 0 in all other cases. Here is an example: 1 AND 1 -- returns 1 0 AND 1 -- returns 0 0 AND 0 -- returns 0 The OR operator OR returns 1 if at least one of its operators is true or 0 if both the operators are false. Here is an example: 1 OR 1 -- returns 1 0 OR 1 -- returns 1 0 OR 0 -- returns 0 The XOR operator XOR stands for eXclusive OR. It is the least used logical operator. It returns 1 if only one of its operators is true or 0 if both the operands are true or false. Take a look at the following example: 1 XOR 1 -- returns 0 1 XOR 0 -- returns 1 0 XOR 1 --returns 1 0 XOR 0 -- returns 0 A XOR B is the equivalent of the following expression: (A OR B) AND NOT (A AND B) Or: (NOT A AND B) OR (A AND NOT B) Arithmetic operators MariaDB supports the operators that are necessary to execute all the basic arithmetic operations. The supported arithmetic operators are: + for additions - for subtractions * for multiplication / for division Depending on the MariaDB configuration, remember that a division by 0 raises an error or returns NULL. In addition, two more operators are useful for divisions: DIV: This returns the integer part of a division without any decimal part or reminder MOD or %: This returns the reminder of a division Here is an example: MariaDB [(none)]> SELECT 20 DIV 3 AS int_part, 20 MOD 3 AS modulus; +----------+---------+ | int_part | modulus | +----------+---------+ | 6 | 2 | +----------+---------+ 1 row in set (0.00 sec) Operators precedence MariaDB does not blindly evaluate the expression from left to right. Every operator has a given precedence. The And operators that is evaluated before another one is said to have a higher precedence. In general, arithmetic and string operators have a higher priority than logical operators. The precedence of arithmetic operators reflect their precedence in common mathematical expressions. It is very important to remember the precedence of logical operators (from the highest to the lowest): NOT AND XOR OR MariaDB supports many operators, and we did not discuss all of them. Also, the exact precedence can slightly vary depending on the MariaDB configuration. The complete precedence can be found in the MariaDB KnowledgeBase, at https://mariadb.com/kb/en/mariadb/documentation/functions-and-operators/operator-precedence/. Parenthesis can be used to force MariaDB to follow a certain order. They are also useful when we do not remember the exact precedence of the operators that we will use, as shown in the following code: (NOT (a AND b)) OR c OR d Summary In this article you learned about the basic transactions and operators. Resources for Article: Further resources on this subject: Set Up MariaDB [Article] Installing MariaDB on Windows and Mac OS X [Article] Building a Web Application with PHP and MariaDB – Introduction to caching [Article]
Read more
  • 0
  • 0
  • 3435
article-image-securing-your-data
Packt
12 Oct 2015
6 min read
Save for later

Securing Your Data

Packt
12 Oct 2015
6 min read
In this article by Tyson Cadenhead, author of Socket.IO Cookbook, we will explore several topics related to security in Socket.IO applications. These topics will cover the gambit, from authentication and validation to how to use the wss:// protocol for secure WebSockets. As the WebSocket protocol opens innumerable opportunities to communicate more directly between the client and the server, people often wonder if Socket.IO is actually as secure as something such as the HTTP protocol. The answer to this question is that it depends entirely on how you implement it. WebSockets can easily be locked down to prevent malicious or accidental security holes, but as with any API interface, your security is only as tight as your weakest link. In this article, we will cover the following topics: Locking down the HTTP referrer Using secure WebSockets (For more resources related to this topic, see here.) Locking down the HTTP referrer Socket.IO is really good at getting around cross-domain issues. You can easily include the Socket.IO script from a different domain on your page, and it will just work as you may expect it to. There are some instances where you may not want your Socket.IO events to be available on every other domain. Not to worry! We can easily whitelist only the http referrers that we want so that some domains will be allowed to connect and other domains won't. How To Do It… To lock down the HTTP referrer and only allow events to whitelisted domains, follow these steps: Create two different servers that can connect to our Socket.IO instance. We will let one server listen on port 5000 and the second server listen on port 5001: var express = require('express'), app = express(), http = require('http'), socketIO = require('socket.io'), server, server2, io; app.get('/', function (req, res) { res.sendFile(__dirname + '/index.html'); }); server = http.Server(app); server.listen(5000); server2 = http.Server(app); server2.listen(5001); io = socketIO(server); When the connection is established, check the referrer in the headers. If it is a referrer that we want to give access to, we can let our connection perform its tasks and build up events as normal. If a blacklisted referrer, such as the one on port 5001 that we created, attempts a connection, we can politely decline and perhaps throw an error message back to the client, as shown in the following code: io.on('connection', function (socket) { switch (socket.request.headers.referer) { case 'http://localhost:5000/': socket.emit('permission.message', 'Okay, you're cool.'); break; default: returnsocket.emit('permission.message', 'Who invited you to this party?'); break; } }); On the client side, we can listen to the response from the server and react as appropriate using the following code: socket.on('permission.message', function (data) { document.querySelector('h1').innerHTML = data; }); How It Works… The referrer is always available in the socket.request.headers object of every socket, so we will be able to inspect it there to check whether it was a trusted source. In our case, we will use a switch statement to whitelist our domain on port 5000, but we could really use any mechanism at our disposal to perform the task. For example, if we need to dynamically whitelist domains, we can store a list of them in our database and search for it when the connection is established. Using secure WebSockets WebSocket communications can either take place over the ws:// protocol or the wss:// protocol. In similar terms, they can be thought of as the HTTP and HTTPS protocols in the sense that one is secure and one isn't. Secure WebSockets are encrypted by the transport layer, so they are safer to use when you handle sensitive data. In this recipe, you will learn how to force our Socket.IO communications to happen over the wss:// protocol for an extra layer of encryption. Getting Ready… In this recipe, we will need to create a self-signing certificate so that we can serve our app locally over the HTTPS protocol. For this, we will need an npm package called Pem. This allows you to create a self-signed certificate that you can provide to your server. Of course, in a real production environment, we would want a true SSL certificate instead of a self-signed one. To install Pem, simply call npm install pem –save. As our certificate is self-signed, you will probably see something similar to the following screenshot when you navigate to your secure server: Just take a chance by clicking on the Proceed to localhost link. You'll see your application load using the HTTPS protocol. How To Do It… To use the secure wss:// protocol, follow these steps: First, create a secure server using the built-in node HTTPS package. We can create a self-signed certificate with the pem package so that we can serve our application over HTTPS instead of HTTP, as shown in the following code: var https = require('https'), pem = require('pem'), express = require('express'), app = express(), socketIO = require('socket.io'); // Create a self-signed certificate with pem pem.createCertificate({ days: 1, selfSigned: true }, function (err, keys) { app.get('/', function(req, res){ res.sendFile(__dirname + '/index.html'); }); // Create an https server with the certificate and key from pem var server = https.createServer({ key: keys.serviceKey, cert: keys.certificate }, app).listen(5000); vario = socketIO(server); io.on('connection', function (socket) { var protocol = 'ws://'; // Check the handshake to determine if it was secure or not if (socket.handshake.secure) { protocol = 'wss://'; } socket.emit('hello.client', { message: 'This is a message from the server. It was sent using the ' + protocol + ' protocol' }); }); }); In your client-side JavaScript, specify secure: true when you initialize your WebSocket as follows: var socket = io('//localhost:5000', { secure: true }); socket.on('hello.client', function (data) { console.log(data); }); Now, start your server and navigate to https://localhost:5000. Proceed to this page. You should see a message in your browser developer tools that shows, This is a message from the server. It was sent using the wss:// protocol. How It Works… The protocol of our WebSocket is actually set automatically based on the protocol of the page that it sits on. This means that a page that is served over the HTTP protocol will send the WebSocket communications over ws:// by default, and a page that is served by HTTPS will default to using the wss:// protocol. However, by setting the secure option to true, we told the WebSocket to always serve through wss:// no matter what. Summary In this article, we gave you an overview of the topics related to security in Socket.IO applications. Resources for Article: Further resources on this subject: Using Socket.IO and Express together[article] Adding Real-time Functionality Using Socket.io[article] Welcome to JavaScript in the full stack [article]
Read more
  • 0
  • 0
  • 1538

article-image-basics-jupyter-notebook-python
Packt Editorial Staff
11 Oct 2015
28 min read
Save for later

Basics of Jupyter Notebook and Python

Packt Editorial Staff
11 Oct 2015
28 min read
In this article by Cyrille Rossant, coming from his book, Learning IPython for Interactive Computing and Data Visualization - Second Edition, we will see how to use IPython console, Jupyter Notebook, and we will go through the basics of Python. Originally, IPython provided an enhanced command-line console to run Python code interactively. The Jupyter Notebook is a more recent and more sophisticated alternative to the console. Today, both tools are available, and we recommend that you learn to use both. [box type="note" align="alignleft" class="" width=""]The first chapter of the book, Chapter 1, Getting Started with IPython, contains all installation instructions. The main step is to download and install the free Anaconda distribution at https://www.continuum.io/downloads (the version of Python 3 64-bit for your operating system).[/box] Launching the IPython console To run the IPython console, type ipython in an OS terminal. There, you can write Python commands and see the results instantly. Here is a screenshot: IPython console The IPython console is most convenient when you have a command-line-based workflow and you want to execute some quick Python commands. You can exit the IPython console by typing exit. [box type="note" align="alignleft" class="" width=""]Let's mention the Qt console, which is similar to the IPython console but offers additional features such as multiline editing, enhanced tab completion, image support, and so on. The Qt console can also be integrated within a graphical application written with Python and Qt. See http://jupyter.org/qtconsole/stable/ for more information.[/box] Launching the Jupyter Notebook To run the Jupyter Notebook, open an OS terminal, go to ~/minibook/ (or into the directory where you've downloaded the book's notebooks), and type jupyter notebook. This will start the Jupyter server and open a new window in your browser (if that's not the case, go to the following URL: http://localhost:8888). Here is a screenshot of Jupyter's entry point, the Notebook dashboard: The Notebook dashboard [box type="note" align="alignleft" class="" width=""]At the time of writing, the following browsers are officially supported: Chrome 13 and greater; Safari 5 and greater; and Firefox 6 or greater. Other browsers may work also. Your mileage may vary.[/box] The Notebook is most convenient when you start a complex analysis project that will involve a substantial amount of interactive experimentation with your code. Other common use-cases include keeping track of your interactive session (like a lab notebook), or writing technical documents that involve code, equations, and figures. In the rest of this section, we will focus on the Notebook interface. [box type="note" align="alignleft" class="" width=""]Closing the Notebook server To close the Notebook server, go to the OS terminal where you launched the server from, and press Ctrl + C. You may need to confirm with y.[/box] The Notebook dashboard The dashboard contains several tabs which are as follows: Files: shows all files and notebooks in the current directory Running: shows all kernels currently running on your computer Clusters: lets you launch kernels for parallel computing A notebook is an interactive document containing code, text, and other elements. A notebook is saved in a file with the .ipynb extension. This file is a plain text file storing a JSON data structure. A kernel is a process running an interactive session. When using IPython, this kernel is a Python process. There are kernels in many languages other than Python. [box type="note" align="alignleft" class="" width=""]We follow the convention to use the term notebook for a file, and Notebook for the application and the web interface.[/box] In Jupyter, notebooks and kernels are strongly separated. A notebook is a file, whereas a kernel is a process. The kernel receives snippets of code from the Notebook interface, executes them, and sends the outputs and possible errors back to the Notebook interface. Thus, in general, the kernel has no notion of the Notebook. A notebook is persistent (it's a file), whereas a kernel may be closed at the end of an interactive session and it is therefore not persistent. When a notebook is re-opened, it needs to be re-executed. In general, no more than one Notebook interface can be connected to a given kernel. However, several IPython consoles can be connected to a given kernel. The Notebook user interface To create a new notebook, click on the New button, and select Notebook (Python 3). A new browser tab opens and shows the Notebook interface as follows: A new notebook Here are the main components of the interface, from top to bottom: The notebook name, which you can change by clicking on it. This is also the name of the .ipynb file. The Menu bar gives you access to several actions pertaining to either the notebook or the kernel. To the right of the menu bar is the Kernel name. You can change the kernel language of your notebook from the Kernel menu. The Toolbar contains icons for common actions. In particular, the dropdown menu showing Code lets you change the type of a cell. Following is the main component of the UI: the actual Notebook. It consists of a linear list of cells. We will detail the structure of a cell in the following sections. Structure of a notebook cell There are two main types of cells: Markdown cells and code cells, and they are described as follows: A Markdown cell contains rich text. In addition to classic formatting options like bold or italics, we can add links, images, HTML elements, LaTeX mathematical equations, and more. A code cell contains code to be executed by the kernel. The programming language corresponds to the kernel's language. We will only use Python in this book, but you can use many other languages. You can change the type of a cell by first clicking on a cell to select it, and then choosing the cell's type in the toolbar's dropdown menu showing Markdown or Code. Markdown cells Here is a screenshot of a Markdown cell: A Markdown cell The top panel shows the cell in edit mode, while the bottom one shows it in render mode. The edit mode lets you edit the text, while the render mode lets you display the rendered cell. We will explain the differences between these modes in greater detail in the following section. Code cells Here is a screenshot of a complex code cell: Structure of a code cell This code cell contains several parts, as follows: The Prompt number shows the cell's number. This number increases every time you run the cell. Since you can run cells of a notebook out of order, nothing guarantees that code numbers are linearly increasing in a given notebook. The Input area contains a multiline text editor that lets you write one or several lines of code with syntax highlighting. The Widget area may contain graphical controls; here, it displays a slider. The Output area can contain multiple outputs, here: Standard output (text in black) Error output (text with a red background) Rich output (an HTML table and an image here) The Notebook modal interface The Notebook implements a modal interface similar to some text editors such as vim. Mastering this interface may represent a small learning curve for some users. Use the edit mode to write code (the selected cell has a green border, and a pen icon appears at the top right of the interface). Click inside a cell to enable the edit mode for this cell (you need to double-click with Markdown cells). Use the command mode to operate on cells (the selected cell has a gray border, and there is no pen icon). Click outside the text area of a cell to enable the command mode (you can also press the Esc key). Keyboard shortcuts are available in the Notebook interface. Type h to show them. We review here the most common ones (for Windows and Linux; shortcuts for Mac OS X may be slightly different). Keyboard shortcuts available in both modes Here are a few keyboard shortcuts that are always available when a cell is selected: Ctrl + Enter: run the cell Shift + Enter: run the cell and select the cell below Alt + Enter: run the cell and insert a new cell below Ctrl + S: save the notebook Keyboard shortcuts available in the edit mode In the edit mode, you can type code as usual, and you have access to the following keyboard shortcuts: Esc: switch to command mode Ctrl + Shift + -: split the cell Keyboard shortcuts available in the command mode In the command mode, keystrokes are bound to cell operations. Don't write code in command mode or unexpected things will happen! For example, typing dd in command mode will delete the selected cell! Here are some keyboard shortcuts available in command mode: Enter: switch to edit mode Up or k: select the previous cell Down or j: select the next cell y / m: change the cell type to code cell/Markdown cell a / b: insert a new cell above/below the current cell x / c / v: cut/copy/paste the current cell dd: delete the current cell z: undo the last delete operation Shift + =: merge the cell below h: display the help menu with the list of keyboard shortcuts Spending some time learning these shortcuts is highly recommended. References Here are a few references: Main documentation of Jupyter at http://jupyter.readthedocs.org/en/latest/ Jupyter Notebook interface explained at http://jupyter-notebook.readthedocs.org/en/latest/notebook.html A crash course on Python If you don't know Python, read this section to learn the fundamentals. Python is a very accessible language and is even taught to school children. If you have ever programmed, it will only take you a few minutes to learn the basics. Hello world Open a new notebook and type the following in the first cell: In [1]: print("Hello world!") Out[1]: Hello world! Here is a screenshot: "Hello world" in the Notebook [box type="note" align="alignleft" class="" width=""]Prompt string Note that the convention chosen in this article is to show Python code (also called the input) prefixed with In [x]: (which shouldn't be typed). This is the standard IPython prompt. Here, you should just type print("Hello world!") and then press Shift + Enter.[/box] Congratulations! You are now a Python programmer. Variables Let's use Python as a calculator. In [2]: 2 * 2 Out[2]: 4 Here, 2 * 2 is an expression statement. This operation is performed, the result is returned, and IPython displays it in the notebook cell's output. [box type="note" align="alignleft" class="" width=""]Division In Python 3, 3 / 2 returns 1.5 (floating-point division), whereas it returns 1 in Python 2 (integer division). This can be source of errors when porting Python 2 code to Python 3. It is recommended to always use the explicit 3.0 / 2.0 for floating-point division (by using floating-point numbers) and 3 // 2 for integer division. Both syntaxes work in Python 2 and Python 3. See http://python3porting.com/differences.html#integer-division for more details.[/box] Other built-in mathematical operators include +, -, ** for the exponentiation, and others. You will find more details at https://docs.python.org/3/reference/expressions.html#the-power-operator. Variables form a fundamental concept of any programming language. A variable has a name and a value. Here is how to create a new variable in Python: In [3]: a = 2 And here is how to use an existing variable: In [4]: a * 3 Out[4]: 6 Several variables can be defined at once (this is called unpacking): In [5]: a, b = 2, 6 There are different types of variables. Here, we have used a number (more precisely, an integer). Other important types include floating-point numbers to represent real numbers, strings to represent text, and booleans to represent True/False values. Here are a few examples: In [6]: somefloat = 3.1415 sometext = 'pi is about' # You can also use double quotes. print(sometext, somefloat) # Display several variables. Out[6]: pi is about 3.1415 Note how we used the # character to write comments. Whereas Python discards the comments completely, adding comments in the code is important when the code is to be read by other humans (including yourself in the future). String escaping String escaping refers to the ability to insert special characters in a string. For example, how can you insert ' and ", given that these characters are used to delimit a string in Python code? The backslash is the go-to escape character in Python (and in many other languages too). Here are a few examples: In [7]: print("Hello "world"") print("A list:n* item 1n* item 2") print("C:pathonwindows") print(r"C:pathonwindows") Out[7]: Hello "world" A list: * item 1 * item 2 C:pathonwindows C:pathonwindows The special character n is the new line (or line feed) character. To insert a backslash, you need to escape it, which explains why it needs to be doubled as . You can also disable escaping by using raw literals with a r prefix before the string, like in the last example above. In this case, backslashes are considered as normal characters. This is convenient when writing Windows paths, since Windows uses backslash separators instead of forward slashes like on Unix systems. A very common error on Windows is forgetting to escape backslashes in paths: writing "C:path" may lead to subtle errors. You will find the list of special characters in Python at https://docs.python.org/3.4/reference/lexical_analysis.html#string-and-bytes-literals. Lists A list contains a sequence of items. You can concisely instruct Python to perform repeated actions on the elements of a list. Let's first create a list of numbers as follows: In [8]: items = [1, 3, 0, 4, 1] Note the syntax we used to create the list: square brackets [], and commas , to separate the items. The built-in function len() returns the number of elements in a list: In [9]: len(items) Out[9]: 5 [box type="note" align="alignleft" class="" width=""]Python comes with a set of built-in functions, including print(), len(), max(), functional routines like filter() and map(), and container-related routines like all(), any(), range(), and sorted(). You will find the full list of built-in functions at https://docs.python.org/3.4/library/functions.html.[/box] Now, let's compute the sum of all elements in the list. Python provides a built-in function for this: In [10]: sum(items) Out[10]: 9 We can also access individual elements in the list, using the following syntax: In [11]: items[0] Out[11]: 1 In [12]: items[-1] Out[12]: 1 Note that indexing starts at 0 in Python: the first element of the list is indexed by 0, the second by 1, and so on. Also, -1 refers to the last element, -2, to the penultimate element, and so on. The same syntax can be used to alter elements in the list: In [13]: items[1] = 9 items Out[13]: [1, 9, 0, 4, 1] We can access sublists with the following syntax: In [14]: items[1:3] Out[14]: [9, 0] Here, 1:3 represents a slice going from element 1 included (this is the second element of the list) to element 3 excluded. Thus, we get a sublist with the second and third element of the original list. The first-included/last-excluded asymmetry leads to an intuitive treatment of overlaps between consecutive slices. Also, note that a sublist refers to a dynamic view of the original list, not a copy; changing elements in the sublist automatically changes them in the original list. Python provides several other types of containers: Tuples are immutable and contain a fixed number of elements: In [15]: my_tuple = (1, 2, 3) my_tuple[1] Out[15]: 2 Dictionaries contain key-value pairs. They are extremely useful and common: In [16]: my_dict = {'a': 1, 'b': 2, 'c': 3} print('a:', my_dict['a']) Out[16]: a: 1 In [17]: print(my_dict.keys()) Out[17]: dict_keys(['c', 'a', 'b']) There is no notion of order in a dictionary. However, the native collections module provides an OrderedDict structure that keeps the insertion order (see https://docs.python.org/3.4/library/collections.html). Sets, like mathematical sets, contain distinct elements: In [18]: my_set = set([1, 2, 3, 2, 1]) my_set Out[18]: {1, 2, 3} A Python object is mutable if its value can change after it has been created. Otherwise, it is immutable. For example, a string is immutable; to change it, a new string needs to be created. A list, a dictionary, or a set is mutable; elements can be added or removed. By contrast, a tuple is immutable, and it is not possible to change the elements it contains without recreating the tuple. See https://docs.python.org/3.4/reference/datamodel.html for more details. Loops We can run through all elements of a list using a for loop: In [19]: for item in items: print(item) Out[19]: 1 9 0 4 1 There are several things to note here: The for item in items syntax means that a temporary variable named item is created at every iteration. This variable contains the value of every item in the list, one at a time. Note the colon : at the end of the for statement. Forgetting it will lead to a syntax error! The statement print(item) will be executed for all items in the list. Note the four spaces before print: this is called the indentation. You will find more details about indentation in the next subsection. Python supports a concise syntax to perform a given operation on all elements of a list, as follows: In [20]: squares = [item * item for item in items] squares Out[20]: [1, 81, 0, 16, 1] This is called a list comprehension. A new list is created here; it contains the squares of all numbers in the list. This concise syntax leads to highly readable and Pythonic code. Indentation Indentation refers to the spaces that may appear at the beginning of some lines of code. This is a particular aspect of Python's syntax. In most programming languages, indentation is optional and is generally used to make the code visually clearer. But in Python, indentation also has a syntactic meaning. Particular indentation rules need to be followed for Python code to be correct. In general, there are two ways to indent some text: by inserting a tab character (also referred to as t), or by inserting a number of spaces (typically, four). It is recommended to use spaces instead of tab characters. Your text editor should be configured such that the Tab key on the keyboard inserts four spaces instead of a tab character. In the Notebook, indentation is automatically configured properly; so you shouldn't worry about this issue. The question only arises if you use another text editor for your Python code. Finally, what is the meaning of indentation? In Python, indentation delimits coherent blocks of code, for example, the contents of a loop, a conditional branch, a function, and other objects. Where other languages such as C or JavaScript use curly braces to delimit such blocks, Python uses indentation. Conditional branches Sometimes, you need to perform different operations on your data depending on some condition. For example, let's display all even numbers in our list: In [21]: for item in items: if item % 2 == 0: print(item) Out[21]: 0 4 Again, here are several things to note: An if statement is followed by a boolean expression. If a and b are two integers, the modulo operand a % b returns the remainder from the division of a by b. Here, item % 2 is 0 for even numbers, and 1 for odd numbers. The equality is represented by a double equal sign == to avoid confusion with the assignment operator = that we use when we create variables. Like with the for loop, the if statement ends with a colon :. The part of the code that is executed when the condition is satisfied follows the if statement. It is indented. Indentation is cumulative: since this if is inside a for loop, there are eight spaces before the print(item) statement. Python supports a concise syntax to select all elements in a list that satisfy certain properties. Here is how to create a sublist with only even numbers: In [22]: even = [item for item in items if item % 2 == 0] even Out[22]: [0, 4] This is also a form of list comprehension. Functions Code is typically organized into functions. A function encapsulates part of your code. Functions allow you to reuse bits of functionality without copy-pasting the code. Here is a function that tells whether an integer number is even or not: In [23]: def is_even(number): """Return whether an integer is even or not.""" return number % 2 == 0 There are several things to note here: A function is defined with the def keyword. After def comes the function name. A general convention in Python is to only use lowercase characters, and separate words with an underscore _. A function name generally starts with a verb. The function name is followed by parentheses, with one or several variable names called the arguments. These are the inputs of the function. There is a single argument here, named number. No type is specified for the argument. This is because Python is dynamically typed; you could pass a variable of any type. This function would work fine with floating point numbers, for example (the modulo operation works with floating point numbers in addition to integers). The body of the function is indented (and note the colon : at the end of the def statement). There is a docstring wrapped by triple quotes """. This is a particular form of comment that explains what the function does. It is not mandatory, but it is strongly recommended to write docstrings for the functions exposed to the user. The return keyword in the body of the function specifies the output of the function. Here, the output is a Boolean, obtained from the expression number % 2 == 0. It is possible to return several values; just use a comma to separate them (in this case, a tuple of Booleans would be returned). Once a function is defined, it can be called like this: In [24]: is_even(3) Out[24]: False In [25]: is_even(4) Out[25]: True Here, 3 and 4 are successively passed as arguments to the function. Positional and keyword arguments A Python function can accept an arbitrary number of arguments, called positional arguments. It can also accept optional named arguments, called keyword arguments. Here is an example: In [26]: def remainder(number, divisor=2): return number % divisor The second argument of this function, divisor, is optional. If it is not provided by the caller, it will default to the number 2, as shown here: In [27]: remainder(5) Out[27]: 1 There are two equivalent ways of specifying a keyword argument when calling a function. They are as follows: In [28]: remainder(5, 3) Out[28]: 2 In [29]: remainder(5, divisor=3) Out[29]: 2 In the first case, 3 is understood as the second argument, divisor. In the second case, the name of the argument is given explicitly by the caller. This second syntax is clearer and less error-prone than the first one. Functions can also accept arbitrary sets of positional and keyword arguments, using the following syntax: In [30]: def f(*args, **kwargs): print("Positional arguments:", args) print("Keyword arguments:", kwargs) In [31]: f(1, 2, c=3, d=4) Out[31]: Positional arguments: (1, 2) Keyword arguments: {'c': 3, 'd': 4} Inside the function, args is a tuple containing positional arguments, and kwargs is a dictionary containing keyword arguments. Passage by assignment When passing a parameter to a Python function, a reference to the object is actually passed (passage by assignment): If the passed object is mutable, it can be modified by the function If the passed object is immutable, it cannot be modified by the function Here is an example: In [32]: my_list = [1, 2] def add(some_list, value): some_list.append(value) add(my_list, 3) my_list Out[32]: [1, 2, 3] The add() function modifies an object defined outside it (in this case, the object my_list); we say this function has side-effects. A function with no side-effects is called a pure function: it doesn't modify anything in the outer context, and it deterministically returns the same result for any given set of inputs. Pure functions are to be preferred over functions with side-effects. Knowing this can help you spot out subtle bugs. There are further related concepts that are useful to know, including function scopes, naming, binding, and more. Here are a couple of links: Passage by reference at https://docs.python.org/3/faq/programming.html#how-do-i-write-a-function-with-output-parameters-call-by-reference Naming, binding, and scope at https://docs.python.org/3.4/reference/executionmodel.html Errors Let's discuss errors in Python. As you learn, you will inevitably come across errors and exceptions. The Python interpreter will most of the time tell you what the problem is, and where it occurred. It is important to understand the vocabulary used by Python so that you can more quickly find and correct your errors. Let's see the following example: In [33]: def divide(a, b): return a / b In [34]: divide(1, 0) Out[34]: --------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) <ipython-input-2-b77ebb6ac6f6> in <module>() ----> 1 divide(1, 0) <ipython-input-1-5c74f9fd7706> in divide(a, b) 1 def divide(a, b): ----> 2 return a / b ZeroDivisionError: division by zero Here, we defined a divide() function, and called it to divide 1 by 0. Dividing a number by 0 is an error in Python. Here, a ZeroDivisionError exception was raised. An exception is a particular type of error that can be raised at any point in a program. It is propagated from the innards of the code up to the command that launched the code. It can be caught and processed at any point. You will find more details about exceptions at https://docs.python.org/3/tutorial/errors.html, and common exception types at https://docs.python.org/3/library/exceptions.html#bltin-exceptions. The error message you see contains the stack trace, the exception type, and the exception message. The stack trace shows all function calls between the raised exception and the script calling point. The top frame, indicated by the first arrow ---->, shows the entry point of the code execution. Here, it is divide(1, 0), which was called directly in the Notebook. The error occurred while this function was called. The next and last frame is indicated by the second arrow. It corresponds to line 2 in our function divide(a, b). It is the last frame in the stack trace: this means that the error occurred there. Object-oriented programming Object-oriented programming (OOP) is a relatively advanced topic. Although we won't use it much in this book, it is useful to know the basics. Also, mastering OOP is often essential when you start to have a large code base. In Python, everything is an object. A number, a string, or a function is an object. An object is an instance of a type (also known as class). An object has attributes and methods, as specified by its type. An attribute is a variable bound to an object, giving some information about it. A method is a function that applies to the object. For example, the object 'hello' is an instance of the built-in str type (string). The type() function returns the type of an object, as shown here: In [35]: type('hello') Out[35]: str There are native types, like str or int (integer), and custom types, also called classes, that can be created by the user. In IPython, you can discover the attributes and methods of any object with the dot syntax and tab completion. For example, typing 'hello'.u and pressing Tab automatically shows us the existence of the upper() method: In [36]: 'hello'.upper() Out[36]: 'HELLO' Here, upper() is a method available to all str objects; it returns an uppercase copy of a string. A useful string method is format(). This simple and convenient templating system lets you generate strings dynamically, as shown in the following example: In [37]: 'Hello {0:s}!'.format('Python') Out[37]: Hello Python The {0:s} syntax means "replace this with the first argument of format(), which should be a string". The variable type after the colon is especially useful for numbers, where you can specify how to display the number (for example, .3f to display three decimals). The 0 makes it possible to replace a given value several times in a given string. You can also use a name instead of a position—for example 'Hello {name}!'.format(name='Python'). Some methods are prefixed with an underscore _; they are private and are generally not meant to be used directly. IPython's tab completion won't show you these private attributes and methods unless you explicitly type _ before pressing Tab. In practice, the most important thing to remember is that appending a dot . to any Python object and pressing Tab in IPython will show you a lot of functionality pertaining to that object. Functional programming Python is a multi-paradigm language; it notably supports imperative, object-oriented, and functional programming models. Python functions are objects and can be handled like other objects. In particular, they can be passed as arguments to other functions (also called higher-order functions). This is the essence of functional programming. Decorators provide a convenient syntax construct to define higher-order functions. Here is an example using the is_even() function from the previous Functions section: In [38]: def show_output(func): def wrapped(*args, **kwargs): output = func(*args, **kwargs) print("The result is:", output) return wrapped The show_output() function transforms an arbitrary function func() to a new function, named wrapped(), that displays the result of the function, as follows: In [39]: f = show_output(is_even) f(3) Out[39]: The result is: False Equivalently, this higher-order function can also be used with a decorator, as follows: In [40]: @show_output def square(x): return x * x In [41]: square(3) Out[41]: The result is: 9 You can find more information about Python decorators at https://en.wikipedia.org/wiki/Python_syntax_and_semantics#Decorators and at http://www.thecodeship.com/patterns/guide-to-python-function-decorators/. Python 2 and 3 Let's finish this section with a few notes about Python 2 and Python 3 compatibility issues. There are still some Python 2 code and libraries that are not compatible with Python 3. Therefore, it is sometimes useful to be aware of the differences between the two versions. One of the most obvious differences is that print is a statement in Python 2, whereas it is a function in Python 3. Therefore, print "Hello" (without parentheses) works in Python 2 but not in Python 3, while print("Hello") works in both Python 2 and Python 3. There are several non-mutually exclusive options to write portable code that works with both versions: futures: A built-in module supporting backward-incompatible Python syntax 2to3: A built-in Python module to port Python 2 code to Python 3 six: An external lightweight library for writing compatible code Here are a few references: Official Python 2/3 wiki page at https://wiki.python.org/moin/Python2orPython3 The Porting to Python 3 book, by CreateSpace Independent Publishing Platform at http://www.python3porting.com/bookindex.html 2to3 at https://docs.python.org/3.4/library/2to3.html six at https://pythonhosted.org/six/ futures at https://docs.python.org/3.4/library/__future__.html The IPython Cookbook contains an in-depth recipe about choosing between Python 2 and 3, and how to support both. Going beyond the basics You now know the fundamentals of Python, the bare minimum that you will need in this book. As you can imagine, there is much more to say about Python. Following are a few further basic concepts that are often useful and that we cannot cover here, unfortunately. You are highly encouraged to have a look at them in the references given at the end of this section: range and enumerate pass, break, and, continue, to be used in loops Working with files Creating and importing modules The Python standard library provides a wide range of functionality (OS, network, file systems, compression, mathematics, and more) Here are some slightly more advanced concepts that you might find useful if you want to strengthen your Python skills: Regular expressions for advanced string processing Lambda functions for defining small anonymous functions Generators for controlling custom loops Exceptions for handling errors with statements for safely handling contexts Advanced object-oriented programming Metaprogramming for modifying Python code dynamically The pickle module for persisting Python objects on disk and exchanging them across a network Finally, here are a few references: Getting started with Python: https://www.python.org/about/gettingstarted/ A Python tutorial: https://docs.python.org/3/tutorial/index.html The Python Standard Library: https://docs.python.org/3/library/index.html Interactive tutorial: http://www.learnpython.org/ Codecademy Python course: http://www.codecademy.com/tracks/python Language reference (expert level): https://docs.python.org/3/reference/index.html Python Cookbook, by David Beazley and Brian K. Jones, O'Reilly Media (advanced level, highly recommended if you want to become a Python expert) Summary In this article, we have seen how to launch the IPython console and Jupyter Notebook, the different aspects of the Notebook and its user interface, the structure of the notebook cell, keyboard shortcuts that are available in the Notebook interface, and the basics of Python. Introduction to Data Analysis and Libraries Hand Gesture Recognition Using a Kinect Depth Sensor The strange relationship between objects, functions, generators and coroutines
Read more
  • 0
  • 0
  • 126551
Modal Close icon
Modal Close icon